Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python Chinese Thesaurus jieba

2025-03-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how to use python Chinese participle library jieba." In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

Install python Chinese word segmentation library jieba

Method 1: Enter conda install jieba under Anaconda Prompt

Method 2: Enter pip3 install jieba under Terminal

1. participle

1.1 Introduction to CUT function

cut(sentence, cut_all=False, HMM=True)

Return generator, traverse generator to get word segmentation results

lcut(sentence)

Return tokenized list

import jiebasence = 'I love natural language processing'#Create [Tokenizer.cut generator] object generator = jieba.cut(sentence)#Traverse generator, print word segmentation results words = '/'.join(generator)print(words)

print result

I/Love/Natural Language/Processing

import jiebaprint(jieba.lcut ('I love Nanhai Middle School'))

print result

"'Me, '' Love','Nanhai Middle School'"

1.2 participle mode

Precision mode: precisely cut

Full mode: all possible words are cut out, fast

Search engine mode: on the basis of precise mode, long words are segmented again

import jiebasence = 'order data analysis'print ('precise mode:', jieba.lcut(sentence))print ('full mode:', jieba.lcut(sentence, cut_all=True))print ('search engine mode:', jieba.lcut_for_search(sentence))

print result

Precision mode: ['order',' data analysis']

Full mode: ['orders',' orders','odds',' data','data-analysis',' analysis']

Search engine mode: ['orders',' data','analysis',' data analysis']

1.3 part-of-speech tagging

jieba.possegimport jieba.posseg as jpsentence = 'I love Python data analysis'posseg = jp.cut(sentence)for i in posseg: print(i.__ dict__) # print(i.word, i.flag)

print result

{'word': 'I',' flag':'r'}{'word': ' love','flag':' v'}{'word': 'Python',' flag':'eng'}{'word': ' data analysis','flag':' l'}

part-of-speech tagging table

adjective mq quantifier tg morpheme ad adverb n noun u auxiliary ag morpheme ng example: meaning milk pavilion ud example: get an noun form word nr name ug example: pass b discriminant word nrfg also is a person name uj example: c conjunction nrt also is a person name ul example: d adverb ns place name uv example: ground df example: do not nt nt organization group uz example: dg adverb nz other proper name v verb e interjection o onomatopoeic word vd adverb f locative word p preposition vg verb g morpheme q quantifier vi example: indulging in equivalent to h preceding component r pronoun vn noun verb i idiom rg example: z vq example: to have been to be to

1.4 Where the word appears.

jieba.tokenize(sentence)import jiebase = 'order data analysis'generator = jieba.tokenize(sentence)for position in generator: print(position)

print result

('orders ', 0, 2)('data analysis', 2, 6)

2. Dictionary

2.1 Default dictionary

import jieba, os, pandas as pd#dictionary location print(jieba.__ file__)jieba_dict = os.path.dirname(jieba.__ file__) + r'\dict.txt'#Reading dictionary df = pd.read_table(jieba_dict, sep ='', header=None)[[0, 2] print(df.head())#Turning dictionary dt = dict(df.values)print(dt.get ('Jinan University'))

2.2, additions and deletions

add words to a dictionary

add_word(word, freq=None, tag=None)

Add_word(word, freq=0)

del_word(word)

import jiebasence = 'The sky lasts forever and sometimes ends, this hatred will never end'#add jieba.add_word ('time end', 999, 'nz') print ('add [time end]:', jieba.lcut(sentence))#delete jieba.del_word ('time end') print ('delete [time end]:', jieba.lcut(sentence))

print result

Add [End of Time]:"'Eternity ', ' Yes','End of Time',' This Regret Continues ', ' Nothing','End of Time'"]

Delete [End of time]:"'Everlasting, '' Sometimes','End',' This hatred continues','Nothing',' End'"]

2.3 Custom dictionary loading

Create a dictionary, add words according to the format [Word Frequency and Part of Speech], and store them in UTF-8 code

Load a dictionary using the function load_userdict

import os, jieba#Create custom dictionary my_dict = 'my_dict.txt'with open (my_dict, 'w', encoding='utf-8') as f: f.write ('Murong Ziying 9 nr\n Yun Tianhe 9 nr\n Tianhe Sword 9 nz')#Load dictionary to test sentence = 'Murong Ziying built Tianhe Sword for Yun Tianhe'print ('before loading', jieba.lcut(sentence))jieba.load_userdict(my_dict)print ('after loading', jieba.lcut(sentence))os.remove(my_dict)

print result

Before loading: [Murong, Zi Ying Wei, Yun, Tianhe, Build, Liao, Tianhe, Jian]

After loading: ['Murong Ziying',' We','Yun Tianhe',' Forged','Laid',' Tianhe Sword']

2.4 To connect or split characters in a word

suggest_freq(segment, tune=False)

import jiebasence = 'The sky above is poor and the sky below is yellow, and there are no two places to see'print ('before correction: ', '|'.join(jieba.cut(sentence)))jieba.suggest_freq(('fall', 'down'), True)print ('corrected:', '| '.join(jieba.cut(sentence)))

print result

Before amendment: upper poverty| bi| fall| yellow spring| , |two| vast| all| gone

Revised: Upper Poor| biluo| under| yellow spring| , |two| vast| all| gone

3. Jieba participle principle

Based on the dictionary, the sentence is scanned by word graph to generate a directed acyclic graph composed of all word formation situations.

Backward computation of the maximum probability path from the DAG (dynamic programming algorithm; logarithm to prevent underflow, multiplication to addition)

Acquiring word segmentation sequence with maximum probability according to path

import jiebaseline = 'Central Primary School Holiday'DAG = jieba.get_DAG(sentence)print(DAG)route = {}jieba.calc(sentence, DAG, route)print(route)

DAG

{0: [0, 1, 3], 1: [1], 2: [2, 3], 3: [3], 4: [4, 5], 5: [5]}

maximum probability path

{6: (0, 0), 5: (-9.4, 5), 4: (-12.6, 5), 3: (-20.8, 3), 2: (-22.5, 3), 1: (-30.8, 1), 0: (-29.5, 3)}

4. Identify [words with spaces]

Example: Make words with spaces like Blade Master recognized

import jieba, sentence = 'Blade Master Archmage'jieba.add_word ('Blade Master')#add word print ('before modification', jieba.lcut(sentence))jieba.re_han_default = re.compile ('(.+) ', re.U) #modify format print ('modified:', jieba.lcut(sentence))

print result

Before modification: ['Blade',' Master','Gale', 'Assassination','Archmage']

Modified: ['Blade Master',' Gale ', ' Assassination ', 'Archmage']

5 and others.

5.1 parallel participle

Operating environment: Linux system

Enable parallel word segmentation mode, parameter n is concurrent number: jieba.enable_parallel(n)

Turn off parallel word segmentation mode: jieba.disable_parallel()

5.2 Keywords extraction

Based on TF-IDF: jieba. analyze

Based on TextRank: jieba.textrank

import jieba. analyze as ja, jiebatext = 'Liu Mengli cast a spell to crack Fox Fairy's spell'jieba.add_word ('Liu Mengli', tag ='nr ')keywords1 = ja.extract_tags (text, allowPOS=('n', 'nr', 'ns', 'nt', 'nz'))print ('based on TF-IDF:', keywords1)keywords2 = ja.textrank(text, allowPOS=('n ', ' nr','ns',' nt','nz'))print ('based on TextRank:', keywords2)

print result

Based on TF-IDF:['Liu Mengli',' Fox Fairy ', ' Spell']

Based on TextRank: ['Fox Fairy', 'Liu Mengli',' Spell']

5.3 Modify HMM parameters

import jiebatext = 'Liu Mengli's Dream Interpretation C Method'print (jieba.lcut(text, HMM=False)) # ['willow',' dream','li',' dream interpretation','C',' fa'] print (jieba.lcut(text)) # ['Liu Mengli',' Dream interpretation','C',' Fa ']jieba.finalseg.emit_P <$'B']<$'C '] = -1e-9 #initiprint (jieba.lcut(text)) # ['Liu Mengli',' Dream interpretation','C',' Fa ']jieba.finalseg.emit_P <$'M']<$'Dream '] = -100 # middleprint (jieba.lcut(text)) # ['willow',' dream','dream',' C','fa'] jieba.finalseg.emit_P['S'][' dream']= -.1 # singleprint (jieba.lcut(text)) # ['willow',' dream','glass',' dream interpretation','C',' fa'] jieba.finalseg.emit_P['E']['dream']= -.01 # endprint (jieba.lcut(text)) # ['Liu Meng',' Li','JieMeng',' C','Fa']jieba.del_word ('Liu Meng')# Force_Split_Wordsprint(jieba.lcut(text)) # ['Liu Meng',' Meng','Li',' JieMeng','C',' Fa ']

print

"'Liu ', ' Meng','Li',' Dream interpretation','C',' Fa ']

"'Liu Mengli','Interpreting Dreams',' C','Fa']

"'Liu Mengli','Interpreting Dreams',' C','Fa']

"'Liu ', ' Mengli','Interpreting Dreams',' C','Fa']

"'Liu ', ' Meng','Li',' Dream interpretation','C',' Fa ']

"'Liu Meng','Li',' Interpreting Dreams','C',' Fa ']

"'Liu ', ' Meng','Li',' Dream interpretation','C',' Fa ']

"How to use python Chinese participle library jieba" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report