In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces how to use Python for social media emotion analysis, the content is very detailed, interested friends can use for reference, I hope it can be helpful to you.
Learn the basics of natural language processing and explore two useful Python packages.
Natural language processing (NLP) is a kind of machine learning, which solves the correlation between spoken or written languages and computer-aided analysis of these languages. We have experienced numerous NLP innovations in our daily life, from writing help and advice to real-time voice translation and interpretation.
This paper studies a specific area of NLP: affective analysis. The focus is on determining the positive, negative or neutral nature of the input language. This section explains the background of NLP and emotional analysis, and explores two open source Python packages. Part 2 demonstrates how to start building your own scalable emotion analysis service.
When learning emotional analysis, it is helpful to have a general understanding of NLP. This paper will not study the essence of mathematics in depth. Instead, our goal is to clarify the key concepts in NLP that are critical to actually integrating these methods into your solution.
Natural language and text data
A reasonable starting point is to start with the definition: "what is natural language?" It is the way we human beings communicate with each other, and the main way of communication is spoken and written. We can go a step further and focus only on text communication. After all, living in the ubiquitous era of Siri and Alexa, we know that speech is a set of computing that has nothing to do with text.
Data prospects and challenges
We only think about using textual data. What can we do with language and text? First of all, language, especially English, has many exceptions in addition to rules, the diversity of meanings and contextual differences, which may confuse human interpreters, not to mention computer translation. In primary school, we learn articles and punctuation, and by speaking our mother tongue, we gain the ability to find words that intuitively express the only meaning. For example, articles such as "a", "the" and "or" appear, which are called stop words in NLP, because the traditional NLP algorithm means that the search stops when they are found in a sequence.
Since our goal is to automatically classify text into emotional classes, we need a way to process text data in a computational way. Therefore, we must consider how to represent text data to the machine. It is well known that the rules for using and interpreting languages are complex, and the size and structure of input text can vary greatly. We need to convert text data into digital data, which is a machine and mathematical way. This transformation belongs to the category of feature extraction.
After extracting the numerical representation of the input text data, one improvement may be to give a text input body, determine a set of vector statistics for the articles listed above, and classify the documents according to these data. For example, too many adverbs may make writers angry, or excessive use of stop words may help identify term papers with content filling. Admittedly, this may not have much to do with the goal of our emotional analysis.
Word bag
When you evaluate whether a textual statement is positive or negative, what context do you use to evaluate its polarity? (for example, whether there are positive, negative or neutral emotions in the text) one way is the implicit adjective: something called "disgusting" is considered negative, but if the same thing is called "beautiful", you will think it is positive. By definition, sayings give people a sense of familiarity, usually positive, while swearing can be a sign of hostility. Text data can also include emoticons, which have fixed emotions.
Understanding the polarity influence of a single word provides the basis for the word bag bag-of-words (BoW) model of the text. It analyzes a set of words or vocabularies and extracts metrics as to whether they exist in the input text. Vocabularies form training data called tags by processing text with known polarities. Extract features from this set of tag data, then analyze the relationship between features, and associate the tags with the data.
The name "word bag" illustrates its purpose: a single word that does not take into account spatial location or context. The vocabulary is usually constructed from all the words that appear in the training set and is often pruned after training. If the stop word is not cleaned up before training, the stop word will be removed because of its high frequency and low context. Rarely used words can also be deleted because of the lack of information for general input instances.
However, it is important to note that you can (and should) further consider the situation where words are outside a single instance of training data, which is called word frequency term frequency (TF). You should also consider the word count of input data in all training instances. In general, low-frequency words that appear in all documents are more important, which is called inverse text frequency index inverse document frequency (IDF). These indicators are sure to be mentioned in other articles and software packages in this topic series, so it is helpful to understand them.
Word bags are useful in many document classification applications. However, in emotional analysis, things can be solved when the problem of lack of situational awareness is taken advantage of. Consider the following sentences:
We don't like this war.
I hate rainy days. The good news is that it is sunny today.
It's not a matter of life and death.
The emotion of these phrases is difficult for human interpreters and for machine translation by paying close attention to the examples of individual words.
You can also use a word group called "n-grams" in NLP. A binary group considers a group of two adjacent words rather than (or except) a single word bag. This should alleviate situations such as the above "dislike", but it is still a problem due to the lack of contextual meaning. In addition, in the second sentence above, the emotional context of the second half of the sentence can be understood as negating the first half. Therefore, the spatial locality of contextual cues will also be lost in this method. From a practical point of view, what complicates the problem is the sparsity of features extracted from a given input text. For a complete large vocabulary, each word has a count, which can be thought of as an integer vector. There are a large number of zero count vectors in the vectors of most documents, which adds unnecessary space and time complexity to the operation. Although many simple ways have been proposed to reduce this complexity, it is still a problem.
Word embedding
Word embedding Word embedding is a distributed representation that allows words with similar meanings to have similar representations. This is based on using real-valued vectors to associate with their surroundings. The point is how words are used, not just whether they exist or not. In addition, a great practical advantage of word embedding is that they focus on dense vectors. By getting rid of the word counting model with a corresponding number of zero-valued vector elements, word embedding provides a more effective computational example in terms of time and storage.
Here are two excellent word embedding methods.
Word2vec
* one is Word2vec, which is developed by Google. As you delve deeper into NLP and emotional analysis, you may see this embedding method. It uses either a continuous word bag continuous bag of words (CBOW) or a continuous skip-gram model. In CBOW, the context of a word is learned according to the words around it during training. Continuous skip-gram learning tends to learn words around a given word. Although this may go beyond the problem you need to solve, if you have ever been faced with having to generate your own word embedding, then the authors of Word2vec advocate using the CBOW method to speed up and evaluate frequent words, while the skip-gram method is more suitable for embedding more important embedding of rare words.
GloVe
The second is the global vector Global Vectors for Word Representation (GloVe) for word representation, which was developed by Stanford University. It is an extension of Word2vec method, which attempts to combine the information obtained by classical global text statistical feature extraction with the local context information determined by Word2vec. In fact, in some applications, GloVe performs better than Word2vec, while in others it is not as good as Word2vec. Ultimately, the target dataset used for word embedding will determine which method *. Therefore, * understand their existence and advanced mechanisms, because you are likely to encounter them.
Create and use word embedding
It is useful to know how to get word embedding. In part 2, you will see that we stand on the shoulders of giants by taking advantage of the substantive work of others in the community. This is one way to get word embedding: even with existing trained and validated models. In fact, there are countless models for English and other languages, and there must be a model that can satisfy your application and let you use it right out of the box!
If not, in terms of development work, the other extreme is to train your own stand-alone model, regardless of your application. In essence, you will get a large number of labeled training data, and may use one of the above methods to train the model. Even so, you are still just understanding the text data you enter. Then, you need to develop a specific model for your application (for example, analyzing the emotional value in software version control messages), which in turn requires your own time and effort.
You can also train a word embedding for data for your application, which will reduce time and effort, but the word embedding will be application-specific, which will reduce its reusability.
Available tool options
Given the amount of time and computing power required, you may want to know how to find a solution to the problem. Indeed, the complexity of developing reliable models can be daunting. The good news, however, is that there are already many proven models, tools, and software libraries that provide us with most of what we need. We will focus on Python because it provides a large number of convenient tools for these applications.
SpaCy
SpaCy provides many language models for parsing input text data and extracting features. It is highly optimized and is known as the fastest library of its kind. Best of all, it's open source! SpaCy performs identification, part-of-speech classification, and dependency comments. It contains a word embedding model for performing this function, as well as other feature extraction operations for more than 46 languages. In the second article in this series, you will see how it can be used for text analysis and feature extraction.
VaderSentiment
The vaderSentiment package provides a measure of positive, negative and neutral emotions. As the title of the original paper ("VADER: a rule-based social media text emotion analysis model") shows, these models are developed and adapted specifically for social media text data. VADER is trained in a complete set of human-tagged data, including common emoticons, UTF-8-encoded emoticons, and colloquial terms and abbreviations (such as meh, lol, sux).
For a given input text data, vaderSentiment returns a triple of polarity fraction percentages. It also provides a single scoring standard, called the vaderSentiment composite indicator. This is a real value in the range of [- 1,1], in which emotions with a score greater than 0.05 are considered positive, and those with a score less than-0.05 are considered negative, otherwise they are neutral.
So much for sharing on how to use Python for social media emotion analysis. I hope the above content can be of some help and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.