In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Positive and negative Monitoring of Network Public opinion based on semantic Features
Annie Qi
Research fellow of Youjie Cinda Technology
In the last article entitled "methods for identifying positive and negative information of online public opinion" (see http://www.eucita.com/blog for details), combined with my research work at http://www.eucita.com, I will introduce in detail the "polarity classification" in emotional analysis, which is closely related to the positive and negative public opinion. This article will continue the theme of the previous article, describe in detail the specific positive and negative identification methods, and analyze the advantages and disadvantages, to help you understand the working principle of popular information processing systems such as "public opinion monitoring", "word-of-mouth monitoring" and "consumer research" in the market.
First of all, reviewing the introduction of the previous chapter, the network evaluation and the positive and negative identification of information, and the technology-leading public opinion monitoring companies, including Youjie Cinda Technology, are all realized through the step of polarity classification (polarity classification). Polarity classification first extracts the relevant words with emotional tendency, which is called "feature extraction" (feature extraction). To put it simply, how to distinguish the positive and negative through the computer is to judge the positive and negative tendency of the article by extracting the positive and negative words in the sentence.
So far, based on the research and survey of UJT, the main feature extraction technologies in the industry are "based on semantics" and "based on word occurrence and frequency". This article will focus on patterns based on semantic features. The next chapter will introduce patterns based on the occurrence of words and their frequency, and discuss their advantages and disadvantages respectively.
The feature extraction mode based on semantic features, that is, according to the meaning of words, that is, to distinguish the positive and negative expressions of sentences according to the literal meaning. There are three important representative approaches to this approach. They are: the method of artificial construction of emotional entries, PMI-IR algorithm (PMI-IR Algorithm) and synonym and antonym methods.
1. Artificial construction of emotional entries
The feature extraction method proposed by Tetsuya Nasukawa and Jeonghee Yi in 2003 is one of the prototypes of semantic analysis methods. They analyze the tendency by identifying the semantic relationship between specific topic words and mood expressions, and use natural language processing technology to analyze the semantic association between specific topics and modal particles. The specific methods are as follows:
As a first step, they manually constructed an emotional vocabulary with 3513 entries. Every word in the dictionary includes emotion, part of speech markers and emotion words in standard forms, such as (good, part of speech marked as positive, bad, part of speech marked as negative). If the emotional word is a verb, as long as the emotion is generated through this verb, the object of the verb will also be included (for example, UJCinda Technology is committed to meeting customer needs with high-tech products. If "commitment" is included as an emotional word and marked as positive, then the description of "meeting customer needs with high-tech products" is considered as a positive message).
Second, they used computer tools (two PoS-tags and a sentence structure parser) to identify phrase boundaries and local dependencies, such as "I like to play ball!" In this sentence, the boundaries of the phrases can be identified as "playing", "like playing" and "I like playing". It can also be analyzed that the object of "playing" is "ball", and the object of "like" is the dependency between phrases such as "playing ball". For each sentence, they extract only one representative emotional word, which is not good enough when there are multiple emotional words in a sentence.
The third step is to search the extracted emotion words in the previously artificially constructed emotion dictionary to find the corresponding words in the emotion dictionary as well as its positive and negative polarity. In this way, the judgment of emotional polarity of a text fragment is completed.
Through the above methods, the accuracy (accuracy) of their experiments is about 75%-95%, but the recall rate (recall rate) is relatively low, only 20%-25%. In other words, the overall experimental retrieval results are very accurate, but there are also a large number of data not captured, and the recall rate is low.
Because it comes from a manual emotional vocabulary, this algorithm can analyze the emotional polarity of adjectives, adverbs, nouns and verbs. In addition, they can understand negative sentences and passive sentences. Moreover, this method can not only analyze the positive and negative emotions, but also extract the corresponding topics.
However, the system also has several obvious weaknesses. First of all, the system requires a lot of manual operation, and the workload of manually setting up the thesaurus will be huge when it needs to be analyzed for large amounts of data. Second, although it can solve negative sentences and passive sentences, misjudgment may occur when dealing with more complex syntactic structures, such as double negative sentences. Third, because of the low recall rate, the system can not effectively distinguish which is the description of objective things and which is the expression of subjective emotion. The reason for the low recall rate is that the emotional lexicon of the system is entered manually, and it is impossible to input all emotional words manually.
2. PMI-IR algorithm (PMI-IR Algorithm)
PMI-IR algorithm was designed by Turney in 2002. compared with the first method of manually constructing emotional entries, their feature selection method is basically the same, but it does not involve much manual work, and this method can classify the whole text, rather than just a small piece of text, to extract positive and negative information about related topics.
Turney uses the PMI-IR algorithm to determine the positive and negative tendencies of vocabulary. He evaluated 410 reviews and achieved an average accuracy of 74%. The basic idea of his algorithm is to extract the subjective word with undetermined emotional polarity and calculate the "entry distance" between it and the two emotional polarities. Which emotional polarity an article is ultimately classified as depends on the average "affective tendency score SO" (semantic orientation) of all adjective and adverbial phrases in the article.
The specific steps are as follows:
First of all, Turney tagged each review article in part of speech. Then match the part of speech tags of two adjacent words, and if their part of speech tags conform to certain rules (the detailed rule table is too complex to describe in detail here), it is extracted as an emotional phrase.
In the second step, each emotion word is regarded as a point of statistical mutual information, and then the mutual information between each emotion word and reference word is calculated through the calculation formula of point mutual information. The formula for calculating point mutual information is as follows:
The third step, through the calculation of the following formula, we can get the emotional tendency score SO of the phrase "w". Through different scores, we can judge whether it is positive or negative, so that the automatic classification process is finished.
Turney's algorithm does not require any manual labeling operation, more importantly, because the affective tendency score SO (w) is a numerical value, this algorithm can not only distinguish the positive and negative emotion by the positive and negative number, but also calculate the emotional intensity. The higher the value, the stronger the positive emotion. This can help customers to evaluate the intensity of positive and negative information of online public opinion. The online public opinion and word-of-mouth monitoring of Youjie Cinda Technology uses this algorithm to assist in evaluating the intensity of public opinion.
However, because this algorithm requires a lot of computer computation, it needs to invest a lot of server resources. In the conclusion of Turney's paper, he also points out that the accuracy of film reviews is lower than that of car reviews. The main reason is that the emotional expressions that appear in the film reviews may not all be the evaluation of the quality of the film, but also the emotions in the plot of the film, such as comedies and tragedies. In fact, this is a problem of comment object selection, and Turney's method can not deal with the problem of comment object selection very well.
3. Synonyms and antonyms
Synonym and antonym method is an algorithm proposed by Minqing Hu and Bing Liu in 2004. This method gives an emotional polarity to every subjective evaluation sentence or paragraph extracted by the system. This method will effectively solve the problem of overburden on the network.
First, when they find an emotional word in a sentence, they will classify the emotional word by checking the emotional word database (WordNet), looking for synonyms and antonyms of the word until they find a word (which may be synonymous or antonym of the undetermined emotional word) and the known emotional relationship of the word. In this way, the newly discovered emotional words are marked as the same emotional trend as synonyms and opposite to antonyms. For example, through the system to find an emotional word "doting", through the database search, found that "love" is synonymous with "doting", and the database marked "love" emotion is positive, then the available "doting" emotion is also positive.
Second, similar to the two methods described earlier, they also classify the polarity of each sentence based on the emotional tendency expressed by the emotional words that appear in the sentence. The semantic tendency of the whole sentence is obtained by calculating the semantic tendency of each emotional word in the whole sentence through a simple weighted average. As in the above example, if "doting" appears in the sentence, and there are no other emotional words, then it can be considered that the sentence is positive from the perspective of online public opinion.
The accuracy of this method is 56%-79%, and the recall rate (recall rate) can reach 67%-80%. Although they only improve the grabbing algorithm of emotional words, rather than the calculation method of emotional tendency score SO, they do not need to search for accurate words completely, but only judge the positive and negative by synonyms and antonyms, which greatly reduces the burden of the network.
The operation principle of the above three ways is very simple, that is, through the computer to distinguish the positive and negative of the relevant words, and then carry on the statistics. However, this semantic-based approach has many problems that can not be solved completely, and the workload is large, and the experimental accuracy and recall rate are not high enough. Based on this, the scientific community has developed another feature extraction method-- the feature extraction method based on the occurrence law of entries. This feature extraction method ignores the semantics of words and focuses on evaluating the emotional polarity of words with higher frequency. Although this statistical method does not seem to be in line with our intuition, it has attracted more and more attention in the industry because of its excellent performance in dealing with complex syntactic structures and even complex expression structures.
This method, which is not easy to understand but performs well, will be described in detail in the next chapter. You can also learn more by visiting the website Http://www.eucita.com.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.