In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the relevant knowledge of "how to use python Chinese participle library jieba." In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!
Install python Chinese word segmentation library jieba
Method 1: Enter conda install jieba under Anaconda Prompt
Method 2: Enter pip3 install jieba under Terminal
1. participle
1.1 Introduction to CUT function
cut(sentence, cut_all=False, HMM=True)
Return generator, traverse generator to get word segmentation results
lcut(sentence)
Return tokenized list
import jiebasence = 'I love natural language processing'#Create [Tokenizer.cut generator] object generator = jieba.cut(sentence)#Traverse generator, print word segmentation results words = '/'.join(generator)print(words)
print result
I/Love/Natural Language/Processing
import jiebaprint(jieba.lcut ('I love Nanhai Middle School'))
print result
"'Me, '' Love','Nanhai Middle School'"
1.2 participle mode
Precision mode: precisely cut
Full mode: all possible words are cut out, fast
Search engine mode: on the basis of precise mode, long words are segmented again
import jiebasence = 'order data analysis'print ('precise mode:', jieba.lcut(sentence))print ('full mode:', jieba.lcut(sentence, cut_all=True))print ('search engine mode:', jieba.lcut_for_search(sentence))
print result
Precision mode: ['order',' data analysis']
Full mode: ['orders',' orders','odds',' data','data-analysis',' analysis']
Search engine mode: ['orders',' data','analysis',' data analysis']
1.3 part-of-speech tagging
jieba.possegimport jieba.posseg as jpsentence = 'I love Python data analysis'posseg = jp.cut(sentence)for i in posseg: print(i.__ dict__) # print(i.word, i.flag)
print result
{'word': 'I',' flag':'r'}{'word': ' love','flag':' v'}{'word': 'Python',' flag':'eng'}{'word': ' data analysis','flag':' l'}
part-of-speech tagging table
adjective mq quantifier tg morpheme ad adverb n noun u auxiliary ag morpheme ng example: meaning milk pavilion ud example: get an noun form word nr name ug example: pass b discriminant word nrfg also is a person name uj example: c conjunction nrt also is a person name ul example: d adverb ns place name uv example: ground df example: do not nt nt organization group uz example: dg adverb nz other proper name v verb e interjection o onomatopoeic word vd adverb f locative word p preposition vg verb g morpheme q quantifier vi example: indulging in equivalent to h preceding component r pronoun vn noun verb i idiom rg example: z vq example: to have been to be to
1.4 Where the word appears.
jieba.tokenize(sentence)import jiebase = 'order data analysis'generator = jieba.tokenize(sentence)for position in generator: print(position)
print result
('orders ', 0, 2)('data analysis', 2, 6)
2. Dictionary
2.1 Default dictionary
import jieba, os, pandas as pd#dictionary location print(jieba.__ file__)jieba_dict = os.path.dirname(jieba.__ file__) + r'\dict.txt'#Reading dictionary df = pd.read_table(jieba_dict, sep ='', header=None)[[0, 2] print(df.head())#Turning dictionary dt = dict(df.values)print(dt.get ('Jinan University'))
2.2, additions and deletions
add words to a dictionary
add_word(word, freq=None, tag=None)
Add_word(word, freq=0)
del_word(word)
import jiebasence = 'The sky lasts forever and sometimes ends, this hatred will never end'#add jieba.add_word ('time end', 999, 'nz') print ('add [time end]:', jieba.lcut(sentence))#delete jieba.del_word ('time end') print ('delete [time end]:', jieba.lcut(sentence))
print result
Add [End of Time]:"'Eternity ', ' Yes','End of Time',' This Regret Continues ', ' Nothing','End of Time'"]
Delete [End of time]:"'Everlasting, '' Sometimes','End',' This hatred continues','Nothing',' End'"]
2.3 Custom dictionary loading
Create a dictionary, add words according to the format [Word Frequency and Part of Speech], and store them in UTF-8 code
Load a dictionary using the function load_userdict
import os, jieba#Create custom dictionary my_dict = 'my_dict.txt'with open (my_dict, 'w', encoding='utf-8') as f: f.write ('Murong Ziying 9 nr\n Yun Tianhe 9 nr\n Tianhe Sword 9 nz')#Load dictionary to test sentence = 'Murong Ziying built Tianhe Sword for Yun Tianhe'print ('before loading', jieba.lcut(sentence))jieba.load_userdict(my_dict)print ('after loading', jieba.lcut(sentence))os.remove(my_dict)
print result
Before loading: [Murong, Zi Ying Wei, Yun, Tianhe, Build, Liao, Tianhe, Jian]
After loading: ['Murong Ziying',' We','Yun Tianhe',' Forged','Laid',' Tianhe Sword']
2.4 To connect or split characters in a word
suggest_freq(segment, tune=False)
import jiebasence = 'The sky above is poor and the sky below is yellow, and there are no two places to see'print ('before correction: ', '|'.join(jieba.cut(sentence)))jieba.suggest_freq(('fall', 'down'), True)print ('corrected:', '| '.join(jieba.cut(sentence)))
print result
Before amendment: upper poverty| bi| fall| yellow spring| , |two| vast| all| gone
Revised: Upper Poor| biluo| under| yellow spring| , |two| vast| all| gone
3. Jieba participle principle
Based on the dictionary, the sentence is scanned by word graph to generate a directed acyclic graph composed of all word formation situations.
Backward computation of the maximum probability path from the DAG (dynamic programming algorithm; logarithm to prevent underflow, multiplication to addition)
Acquiring word segmentation sequence with maximum probability according to path
import jiebaseline = 'Central Primary School Holiday'DAG = jieba.get_DAG(sentence)print(DAG)route = {}jieba.calc(sentence, DAG, route)print(route)
DAG
{0: [0, 1, 3], 1: [1], 2: [2, 3], 3: [3], 4: [4, 5], 5: [5]}
maximum probability path
{6: (0, 0), 5: (-9.4, 5), 4: (-12.6, 5), 3: (-20.8, 3), 2: (-22.5, 3), 1: (-30.8, 1), 0: (-29.5, 3)}
4. Identify [words with spaces]
Example: Make words with spaces like Blade Master recognized
import jieba, sentence = 'Blade Master Archmage'jieba.add_word ('Blade Master')#add word print ('before modification', jieba.lcut(sentence))jieba.re_han_default = re.compile ('(.+) ', re.U) #modify format print ('modified:', jieba.lcut(sentence))
print result
Before modification: ['Blade',' Master','Gale', 'Assassination','Archmage']
Modified: ['Blade Master',' Gale ', ' Assassination ', 'Archmage']
5 and others.
5.1 parallel participle
Operating environment: Linux system
Enable parallel word segmentation mode, parameter n is concurrent number: jieba.enable_parallel(n)
Turn off parallel word segmentation mode: jieba.disable_parallel()
5.2 Keywords extraction
Based on TF-IDF: jieba. analyze
Based on TextRank: jieba.textrank
import jieba. analyze as ja, jiebatext = 'Liu Mengli cast a spell to crack Fox Fairy's spell'jieba.add_word ('Liu Mengli', tag ='nr ')keywords1 = ja.extract_tags (text, allowPOS=('n', 'nr', 'ns', 'nt', 'nz'))print ('based on TF-IDF:', keywords1)keywords2 = ja.textrank(text, allowPOS=('n ', ' nr','ns',' nt','nz'))print ('based on TextRank:', keywords2)
print result
Based on TF-IDF:['Liu Mengli',' Fox Fairy ', ' Spell']
Based on TextRank: ['Fox Fairy', 'Liu Mengli',' Spell']
5.3 Modify HMM parameters
import jiebatext = 'Liu Mengli's Dream Interpretation C Method'print (jieba.lcut(text, HMM=False)) # ['willow',' dream','li',' dream interpretation','C',' fa'] print (jieba.lcut(text)) # ['Liu Mengli',' Dream interpretation','C',' Fa ']jieba.finalseg.emit_P <$'B']<$'C '] = -1e-9 #initiprint (jieba.lcut(text)) # ['Liu Mengli',' Dream interpretation','C',' Fa ']jieba.finalseg.emit_P <$'M']<$'Dream '] = -100 # middleprint (jieba.lcut(text)) # ['willow',' dream','dream',' C','fa'] jieba.finalseg.emit_P['S'][' dream']= -.1 # singleprint (jieba.lcut(text)) # ['willow',' dream','glass',' dream interpretation','C',' fa'] jieba.finalseg.emit_P['E']['dream']= -.01 # endprint (jieba.lcut(text)) # ['Liu Meng',' Li','JieMeng',' C','Fa']jieba.del_word ('Liu Meng')# Force_Split_Wordsprint(jieba.lcut(text)) # ['Liu Meng',' Meng','Li',' JieMeng','C',' Fa ']
"'Liu ', ' Meng','Li',' Dream interpretation','C',' Fa ']
"'Liu Mengli','Interpreting Dreams',' C','Fa']
"'Liu Mengli','Interpreting Dreams',' C','Fa']
"'Liu ', ' Mengli','Interpreting Dreams',' C','Fa']
"'Liu ', ' Meng','Li',' Dream interpretation','C',' Fa ']
"'Liu Meng','Li',' Interpreting Dreams','C',' Fa ']
"'Liu ', ' Meng','Li',' Dream interpretation','C',' Fa ']
"How to use python Chinese participle library jieba" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.