In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what is the use of NLP". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the use of NLP"?
Processing data
The data or text corpus used in the experiment (usually referred to as the corpus in NLP) is a movie script. However, before using this data, you need to do some filtering. Because the text of psychological description, action description or scene description, as well as the character name in front of each line (only indicates the speaker, not as a corpus of text analysis) are not the objects of this study. So, things like "Thanoscrushes the Tesseract, revealing the blue Space Stone..." The exterminator crushed the Rubik's cube and got the blue space jewel. Sentences like that have been deleted.
In addition, terms such as "I", "you" and "an" that are marked as stop words (commonly used words, mostly articles, prepositions, adverbs or conjunctions) will not be processed as part of the spaCy data processing step. At the same time, only the standard form of the word, that is, the root, was used in the experiment. For example, the verbs "talk", "talked" and "talking" are different forms of the same word "talk", so the root of these words is "talk".
To process a piece of text in spaCy, you first need to load the language model, and then call the model on the text corpus for text processing. The result is an Doc file that covers all processed text.
Importspacy # load a medium-sized language model nlp = spacy.load ("en_core_web_md") with open ('cleaned-script.txt', 'r') asfile: text = file.read () doc = nlp (text)
Create a Doc file in spaCy
Then a corpus which is processed and accounts for a high proportion of effective information can be obtained. Then we can start the experiment!
The top ten verbs, nouns, adverbs and adjectives most frequently used in the whole movie
Is it possible to infer the overall direction and plot of the film simply by understanding the verbs that appear most frequently? The chart below proves this point.
"I know". , "you think" (do you think.) Is the most common phrase.
"know", "go", "come", "get", "think", "tell", "kill", "need", "stop", and "want" (want). What can be inferred from this? Since the film is released in 2018, I believe most audiences already know what kind of story it tells: from these verbs, it is inferred that "Avengers 3" is about understanding, thinking and investigating how to stop something or someone.
The number of verbs can be counted through the following code:
Importspacy # load a medium-sized language model nlp= spacy.load ("en_core_web_md") withopen ('cleaned-script.txt', 'r') as file: text = file.read () doc= nlp (text) # map with frequency count pos_count= {} fortoken in doc: # ignore stop words if token.is_stop: continue # pos should be one of these: #' VERB', 'NOUN' 'ADJ' or' ADV' if token.pos_ = 'VERB': if token.lemma_ in pos_count: pos_ count [token.lemma] + = 1 else: pos_ count [token.lemma] = 1 print ("top10 VERBs {}" .format (sorted (pos_count.items (), key=lambda kv: kv [1], reverse=True) [: 10]))
So will adverbs, which describe verbs, have the same experimental effect?
"I seriously don't know how you fit your head into that helmet." I really don't know how that helmet got into your head.) Dr. Strange.
For a movie about preventing purple potato essence from destroying half the universe, there are many positive adverbs like "right" (yes), "exactly" (that's it) and "better" (better).
So, knowing the action and action description in the movie, it's time to look at the nouns.
"You will pay for his life with yours.Thanos willhave that stone." It will be a life for life, and the bully will always get that jewel.) Dark Night Proxima Star
It turns out that "stones" (gemstones) unexpectedly appears the most often, after all, the whole film revolves around them. The second most frequent appearance is "life" (life) that the bully wants to destroy, followed by "time" (time) that the Avengers don't have much (note: it may also be because "theTime Stone", the jewel of time, is mentioned many times in the movie).
Before moving on to the next experimental project, explore adjectives or words that describe nouns. Similar to adverbs, there are positive words such as "good" and "right", and affirmative words such as "okay" and "sure".
"Isimm sorry, little one." (sorry, little guy)-- the bully
Verbs and nouns most frequently used by specific roles
The previous picture lists the most common verbs and nouns in the movie. Although these results give us a certain understanding of the overall feeling and plot of the film, it does not tell too much about the personal experiences of each character. Therefore, in the personal lines of a particular role, the verbs and nouns that appear in the top ten times are found by using the same procedure as before.
Because there are many characters in the movie, only some characters with a large number of lines are selected in this experiment. These characters are Iron Man, Doctor Strange, Gamora, Thor, Rocket Raccoon, Star Lord, Ebony throat and Thor. I'm sorry, the captain didn't make it.
The following figure shows the 10 nouns most frequently used by these roles.
Why on earth is the Star Lord called Drax so often?
Unexpectedly, in most cases, the most common nouns mentioned by dear heroes are the names of their companions. For example, Iron Man mentioned "kid" nine times, Rocket Raccoon called Quill (Star Lord) three times, and Star Lord called (actually yelled) Drax seven times.
Through further observation, it can be inferred that the most important thing in each character's mind. In the case of Iron Man, for example, statistics show that Earth is very important to him. Gamora's situation is similar. She always talks about the broader entities such as "life", "universe" and "planet" and gives her life for it. Dr. Strange repeatedly mentioned his goal, which is not quite the same as other heroes-to protect the time jewel. And Thor, because of the national hatred between him and the bully, he mentioned the name of the bully as many as eight times, and, of course, a new "straight noodle" friend-a rocket raccoon that looks like a rabbit. The data from a picture shows that the bully keeps talking about collecting all the gems and calls to his daughter many times.
Nouns express meaning, but verbs may not be able to express the characteristics of roles as clearly as nouns. As you can see in the picture below, the expressive ability of verbs has little effect compared to that of nouns. Commonly used words that lack characteristics, such as "know", "want" and "get", appear frequently. However, Ebony throat, the number one fan of Miba, may have the most unique verbs in the entire corpus. Ebony throat is like a loyal servant: apart from trying to get the jewel of time, his main job is to preach his master's mission with the words "listen" and "feel honored". Tut-tut, that's flattering.
"Hear me, and rejoice. You have had the privilege of being saved by the Great Titan..." Get down on your knees and listen and feel honored! You are fortunate to be saved by the greatest savior.) -- Ebony throat
At the end, there is a colored egg (fog): what Groot says most is--
"I am Groot." I'm Groot.)
Named entity
So far, we have completed the exploration of the whole film and the verbs, nouns, adverbs and adjectives most commonly used by heroes and villains. However, in order to fully understand all the words that have been studied, it is necessary to add some context, that is, named entities, to do research.
According to the web page description about spaCy, a named entity is "the actual object of the specified name-for example, the title of a person, a country, a product, or a book." So, knowing these entities means knowing what the character is talking about. In the spaCy program source library, entities have a predictive label that classifies entities into adult, product, art vocabulary, and so on (https://spacy.io/api/annotation#named-entities)), thus providing an additional level of granularity for subsequent experiments and helping to further classify entities. However, in order to simplify the process, the entity itself will be used instead of entity classification in this experiment.
These are the top 30 entities in the number of occurrences.
"MATEFAYA HU" is the slogan of Wakanda Jabali tribal fighters before the battle.
First of all, considering that the whole movie is about bullying, it is reasonable that bullies appear at most. He was followed by his daughter, Gamora, one of the core characters in the film. Then in third place was Groot (no need to explain), followed by Iron Man and other Avengers, as well as locations such as New York, Asgard and Wakanda (long live Wakanda). In addition to the hero's name and place, the "six gemstones" ("six Infnity Stones") of "six", the Time Stone and theSoul Stone appear in the 14th, 15th and 16th places, respectively. Unexpectedly, the spiritual jewel that attracted the bully to Earth is not in the top 30.
You can read the physical label 'ents':' of each word in the Doc file with the following code
Importspacy # load a medium-sized language model nlp = spacy.load ("en_core_web_md") with open ('cleaned-script.txt' 'r') as file: text = file.read () doc = nlp (text) # create an entity frequency map entities = {} # named entities for ent in doc.ents: # Print the entity text and its label ifent.text in entities: entities [ent.text] + = 1 else: entities [ent.text] = 1 print ("top entities {}" .format (sorted (entities.items ()) Key=lambda kv: kv [1], reverse=True) [: 30]))
The similarity between lines and whites
When discussing the most commonly used verbs in each role, we realize that the verbs they use are very similar and express the same feeling, which is different from the conclusion drawn from the analysis of nouns. Words like "go" and "come" create the feeling and tendency that characters want to go or arrive at a particular place, while verbs like "kill" and "stop" imply that there is a huge threat that must be stopped.
Considering this result, in order to continue to study the similarity, the experiment proposed to calculate the score to measure the similarity of each character's dialogue.
In NLP, similarity is defined as a measure that describes the relevance of the structure or syntactic meaning of two paragraphs of text-usually, the similarity score is between 0 and 1, 0 means completely different, and 1 indicates complete similarity (or the two texts are exactly the same). Technically, similarity is calculated by measuring the distance between word vectors (multidimensional representations of words). If you are interested in learning more about word vectors, it is recommended to search for the common algorithm for generating word vectors-word2vec. The picture below is the similarity matrix of the dialogue of each character.
This picture proves once again that ebony throat is really the most unique role.
This result can be said to be "surprised or not surprised! meaning is not surprised!" Yes. On the one hand, since the film has only one main plot, it is understandable that the relevance in the dialogue leads to the similarity of the dialogue of all the characters close to 1. However, unexpectedly, their scores were too close. The expectation of the experiment is that at least the dialogue of Miaoba is less similar to that of other heroes. After all, for such a villain, other heroes are constantly talking about how to stop him. Thankfully, Spider-Man's line similarity scores fluctuate; after all, he's just a kid called in to save the world on his way to school, so it's not surprising.
The following code demonstrates how to calculate the similarity between two lines in spaCy:
# for the full example onhow I obtained all the similarities # see the full code at: https://github.com/juandes/infinity-war-spacy/blob/master/script.py import spacy # load a medium-sized language model nlp = spacy.load ("en_core_web_md") with open ('tony-script.txt', 'r') as file: tony_lines = file.read () with open (' thor-script.txt' 'r') as file: thor_lines = file.read () tony_doc = nlp (tony_lines) thor_doc = nlp (thor_lines) similarity_score = tony_doc.similarity (thor_doc) print ("Similarity between Tony's and Thor's docs is {}" .format (similarity_score)) Thank you for your reading The above is the content of "what is the use of NLP". After the study of this article, I believe you have a deeper understanding of what the use of NLP is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.