Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to learn the basic Operation word Vector Model of NLP Natural language processing in Python

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article shows you how to learn the basic operation word vector model of NLP natural language processing in Python. The content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Overview

Starting today, we will start a journey of Natural language processing (NLP). Natural language processing enables processing, understanding, and use of human language to bridge the gap between machine language and human language.

Word vector

Let's first talk about what the word vector is. When we give the text to the algorithm, the computer cannot understand the text we enter, and the word vector is born. Simply put, a word vector is a vector that converts words into numbers.

When we describe a person, we use various indicators such as height and weight, which can be used as vectors. With vectors, we can use different methods to calculate similarity.

So how do we describe the characteristics of the language? We divide the language into words and then construct features at the word level.

Word vector dimension

The higher the dimension of the word vector, the more information it can provide, and the reliability of the calculation results is more reliable.

50-dimensional word vector:

Show it with a heat chart:

From the picture above, we can see that similar words are similar in feature expression. From this, it can also be proved that the characteristics of words are meaningful.

Word2Vec

Word2Vec is a pre-trained 2-layer neural network that can help us convert words into vectors. Word2Vec is divided into two learning methods: CBOW and Skip-Gram.

CBOW model

CBOW (Continuous Bag-of-Words) predicts the middle word according to the context around the word. As shown in the figure:

Skip-Gram model

Skip-Gram is used to predict words within a specific range before and after the current word in the same sentence.

Training data set required by Skip-Gram:

Negative sampling model

If a corpus is slightly larger, there are simply too many possible results. The last layer of the word vector model is equivalent to softmax (conversion to probability) and can be very time-consuming to calculate.

We can change the input into two words to determine whether the two words are the corresponding input and output, that is, a binary task.

But we will find a problem, at this time the training set constructed by the label is all 1, can not carry out better training. At this time, the negative sampling model will come in handy. (default is 5)

The training process of word vector 1. Initialize word vector matrix

two。 Back propagation of neural network

Updates are calculated by back propagation of the neural network. At this time, not only the weight parameter matrix W, but also the input data will be updated.

Word vector model actual combat

Format:

Word2Vec (tokenized, sg=1, window=5, min_count=2, negative=1, sample=0.001, hs=1, workers=4)

Parameters:

Seg: 1 is the skip-gram algorithm, which is sensitive to low matching words. Default sg=0, CBOW algorithm

Window: the maximum distance between the current word and the target word in a sentence. 3 means to look at 3murb words before the target word, followed by b words (b is random between 0 and 3)

Min_count: filter words. Words with a frequency less than min-cout will be ignored. The default is 5.

The training model import jiebafrom gensim.models import Word2Vec# acquires the stop word file = open (".. / stop_words/cn_stopwords.txt", encoding= "utf-8") stop_word = set (file.read ()) print ("stop word:", stop_word) # debug output # definition corpus content = ["the Yangtze River is the largest river in China, with a total length of 6397 km (with the Tuotuo River as the source), generally known as 6300 km. The total area of the basin is more than 180 square kilometers, with an annual average of more than 9600 billion cubic meters of water entering the sea. In terms of the length of the main stream and the amount of sea water, the Yangtze River ranks third in the world. " The Yellow River, also known as the river in ancient China, originates from the Bayan Kara Mountains in Qinghai Province of the people's Republic of China, flows through nine provinces and regions of Qinghai, Sichuan, Gansu, Ningxia, Inner Mongolia, Shaanxi, Shanxi, Henan and Shandong, and finally flows into the Bohai Sea in Dongying Kenli County, Shandong Province. the main river has a total length of 5464 kilometers, second only to the Yangtze River and the second longest river in China. The Yellow River is also the fifth longest river in the world. " The Yellow River is the mother river of the Chinese nation. As the birthplace of Chinese civilization, it maintains the blood of the descendants of Yan Huang. It is a symbol of the national spirit and national emotion of the Chinese nation. " The Yellow River is known as the mother river of Chinese civilization. Huaxia formed and multiplied in the Central Plains of the Yellow River in more than 2000 BC. " The section of the Yellow River above Hekou Town, Tuoketuo County, Inner Mongolia is the upper reaches of the Yellow River. " According to the different river course characteristics, the upper reaches of the Yellow River can be divided into three parts: the river source section, the canyon section and the alluvial plain. The Yellow River is the mother river of the Chinese nation. "] # participle seg = [jieba.lcut (sentence) for sentence in content] # remove stop words & punctuation operation tokenized = [] for sentence in seg: words = [] for word in sentence: if word not in stop_word & {'(',')'}: words.append (word) tokenized.append (words) print (tokenized) # debug output # create model model = Word2Vec (tokenized, sg=1, window=5, min_count=2 Negative=1, sample=0.001, hs=1, workers=4) # Save model model.save ("model")

Output result:

Building prefix dict from the default dictionary... Loading model from cache C:\ Users\ Windows\ AppData\ Local\ Temp\ jieba.cache stop words: {'it', 'count','bi', 'Shu', 'needle', 'almost', 'phase', 'lucky','up', 'slow', 'called', 'suave', 'Shi', 'out','er', 'squeak', 'Zhe', 'want', 'body' 'so', 'big','?' , 'Yes', 'Home', 'Jie','Ji', 'Sui', 'you', 'condition',''like', 'you','er', 'return', 'Guo', 'Jane','Yo', 'you','la', 'between', 'stop', 'only','ah', 'Hello', 'step', 'wait', 'don't' 'material', 'two','or', 'knot', 'Nai', 'Jing', 'people', 'Fang', 'Ruo', 'nothing', '3mm', 'hum', '6fu', 'Jian','Mo', 'who', 'will','we', 'Bah', 'let', 'root', 'solid', 'only', 'Zhi','Yu' 'just', 'multiply', 'take', 'pfft', 'change', 'follow', 'times','li', 'generation', 'die', 'class', 'nasty', 'classics', 'beginning', 'ask', 'more','er','no', 'order', 'Deng', 'Shou','Xu', 'Yun', 'Shang','de' 'this', 'Zhu', 'husband', 'bar', 'see', 'many', 'species', 'hey', 'should', 'ran', 'small', 'except', though, 'two','ya','ji', 'pole', 'heaven', 'front','Yi', 'enter', 'set', 'hope', 'right', 'that', 'disciple' 'anti', 'dong','$', 'hey', 'alas','ho','da', 'receive', 'straight', 'according to', 'even', 'body', 'wow', 'rather', 'obey', 'words', 'Ren', 'now', 'point','Yi', 'tight','I', 'independent', 'such as' Dan' 'Zheng','Oh', 'Down', 'already', 'hit', 'pick up','er',', 'but','in', 'edge', 'vertical','he', 'Ding', 'every', 'over', 'along','Ze','do', 'like', 'wish','!' , 'Quan', 'Bai','0', 'value', 'not', 'another', 'turn', 'give', 'Cheng', 'year', 'cut', 'special','to', 'just','5','Ba', 'place','Yi','ai', 'where', 'know', 'detention','to', 'some', 'blink' 'Lai', 'Qiao', 'chase', 'eye', 'since', '2years', 'say', 'this', 'more', 'base', 'eliminate', 'wheeze','to', 'wow', 'very','no', 'save', 'like', 'borrow',' , 'also', 'Yue', 'most',' ',' risk', 'Shu', 'Cheng', 'Guang', 'concurrently', 'what','er','ha', 'don't', 'its', 'free', 'Zeng','Ji', 'how', 'first', 'very', 'make', 'eight', 'woo','re', 'despicable', 'suppressed', 'waiting','re', 'boss' 'Yue','he','du','if','Yi', 'cut','Li', 'Zuo', 'Chong','Pa', 'Tao', 'Fen', 'metaphor', 'rely', 'cause', 'etc., what','da', 'Shh', 'Chao', 'press', 'sentence', 'words', 'Shi', 'and' tube' So'so', 'Guan', 'outer', 'which','Xi', 'Xiang', 'limit', 'noodle','no', 'add', 'Shun', 'cough', 'thief','Mo','Yi','Li', 'Nai','GE', 'Zhao', 'vomit','', 'Wan','Yu', 'like' '9','I', 'and','7', 'less', 'from', 'afraid', 'ground','on','ya',' Go, go, 'follow', 'frighten', 'Wow','du', 'now', 'but', 'buzz', 'that', 'Wei', 'distance','Ho','1','Qi', 'how', 'well','be','be', 'and','yo','um','', 'still', 'bit', 'hee', 'while' 'ha', 'Fan', 'example', 'Teng','Wu', 'Yan', 'substitute', 'and', 'false', 'but', 'diffuse','do', 'Tong', 'Cai', 'Zhong', 'she', 'old', 'true','no', 'open', 'both', 'through', 'difficult','we', 'true' 'look', 'you', 'Zong', 'period', 'only', 'Lin', 'with', 'Ken', 'side', 'Hou','ga','de', 'Dang','no'} Loading model cost 1.641 seconds.Prefix dict has been built successfully. ['Yangtze River','is', 'China', 'first', 'Great River', 'main stream', 'full length', '6397','km','(','to', 'Tuotuo River', 'source',')', 'general', 'general', 'called', '6300 miles,' km'. , 'basin', 'total area', '180', 'more than 10,000 square kilometers', 'year', 'average', 'entering the sea', 'water', 'about', 'nine thousand', 'six hundred million cubic meters','. ,'to', 'main stream', 'length', 'and', 'enter the sea', 'water','on','on', 'Yangtze River', 'average', 'occupy', 'world', 'third','] , ['Yellow River', 'China', 'Ancient', 'also', 'called River', 'Origin','Yu', 'people's Republic of China', 'Qinghai Province', 'Bayan Kara Mountain', 'vein',' ,', 'Liujing', 'Qinghai', Sichuan,',', Gansu,',', Ningxia,',', Inner Mongolia,',', Shaanxi,',', Shanxi,',', Henan,',', Shandong, 'nine provinces,' Finally, Yu, Shandong Province, Dongying, Kenli County, injection, Bohai Sea,'. , 'main stream', 'river course', 'total length', '5464', 'kilometer', 'next to', 'Yangtze River', 'for', 'China', 'second', 'Changhe'. , 'Yellow River', 'still', 'World', 'Fifth', 'long River','] , 'Yellow River',','is, 'Chinese nation','of', 'Mother River'. ,'as', 'Chinese civilization','', 'birthplace', 'maintain', 'Chinese descendants', 'blood',','is', 'Chinese nation', 'nation', 'spirit', 'and', 'nation', 'emotion', 'symbol','] , 'Yellow River','be called', 'Chinese Civilization', 'Mother River',' ,'BC', '2000', 'many years', 'Huaxia', 'clan','in', 'Yellow River', 'territory','of', 'Central Plains', 'form', 'multiply','] ,'in', 'Lanzhou','of', 'Yellow River', 'first', 'Bridge', 'Inner Mongolia', 'Inner Mongolia', 'Tuoketuo County', 'Hekou Town', 'above', 'Yellow River', 'reach', 'for', 'Yellow River', 'upstream','] , ['Yellow River', 'upstream', 'according to', 'river course', 'characteristics', 'different', 'different', 'again', 'can','be divided into', 'river source', 'section', 'canyon', 'section', 'and', 'alluvial plain', 'three', 'partial','. ,'], ['Yellow River','','is', 'Chinese nation','of', 'Mother River',']] Use the model from gensim.models import Word2Vec# loading model model = Word2Vec.load ("model") # to judge the similarity sim1 = model.wv.similarity ("Yellow River", "Yangtze River") print (sim1) sim2 = model.wv.similarity ("Yellow River", "Yellow River") print (sim2) # predict the nearest person most_similar = model.wv.most_similar (positive= ["Yellow River", "Mother River"], negative= ["Yangtze River") print (most_similar)

Output result:

0.204150450.99999994 [('km', 0.15817636251449585), ('upstream', 0.15374179184436798), ('entering the sea', 0.15248821675777435), ('mainstream', 0.15130287408828735), ('yes', 0.14548806846141815), ('yes', 0.11208685487508774), ('paragraph', 0.09545847028493881), ('for', 0.0872812420129776), ('Yu', 0.0529477047423172), ('Changhe') 0.02978350967168808)] the above is how to learn the basic operation word vector model of NLP natural language processing in Python. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report