Analysis of jieba Source Code in Python 07/08 Update SLTechnology News&Howtos

Analysis of jieba Source Code in Python

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "jieba source code analysis in Python". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor to take you to learn "jieba source code analysis in Python"!

Preface

Jieba word segmentation is one of several popular Chinese word segmentation tools in Python. In order to understand the working principle of the word segmentation tool and the implementation details, the jieba is read in detail.

Before I read the code, I have a few questions like this:

What are the steps in the implementation of the word segmentation tool?

The document theory of stuttering word segmentation uses the HMM model, but how is the HMM model used in the word segmentation tool? And how did the model come into being?

Almost all word segmentation tools support users to add thesaurus, but what role does user thesaurus play in the process of word segmentation?

Brief introduction

Jieba participle supports three word segmentation modes, and the official document gives the following Example

Import jieba seg_list = jieba.cut ("I came to Tsinghua University in Beijing", cut_all=True) print ("Full Mode:" + "/" .join (seg_list)) # full mode seg_list = jieba.cut ("I came to Tsinghua University in Beijing" Cut_all=False) print ("Default Mode:" + "/" .join (seg_list)) # precise mode seg_list = jieba.cut ("he came to NetEase Hangyan Building") # the default is precise mode print ("," .join (seg_list)) seg_list = jieba.cut_for_search ("Xiao Ming Master graduated from the Institute of Computing, Chinese Academy of Sciences Later, he studied at Kyoto University in Japan ") # search engine model print (", ".join (seg_list))

Given the space limitations of the article, I will interpret in detail the default mode, that is, all implementations of the jieba.cut method. Some algorithmic principles will be involved in the reading process, which will not be explained in detail in this article.

Firstly, the probabilistic undirected graph is used to obtain the maximum probabilistic path. The construction of probabilistic undirected graph completely depends on the dictionary, and the solution of the maximum probabilistic path also depends on the word frequency in the dictionary. Finally, the HMM model is used to solve unregistered words (Out Of Vocabulary), so it is OK if there is no model in the whole process, as long as you have a good dictionary. There are many methods to solve the maximum probability path. I remember that the solution of HanLP has the shortest path.

Coarse division

First of all, the text will be segmented with regularities. What is the regularization? Just like the current one is default mode or full mode. The rules are as follows:

Re_han_default = re.compile ("([\ u4E00 -\ u9FD5a-zA-Z0-9 u9FD5 &\. _] +)", re.U) re_han_cut_all = re.compile ("([\ u4E00 -\ u9FD5] +)", re.U)

What's the difference: I wrote a test:

Test_str = u'I abc in Chongqing, he is also in Chongqing? 1234 are you in Chongqing 'print (re_han_default.split (test_str)) print (re_han_cut_all.split (test_str))

Output:

['','I am in abc',','he is also in Chongqing', '1234 are you in Chongqing',']','I am in Chongqing', 'abc,',' he is also in Chongqing,'? 12344th, 'are you in Chongqing',']

Each of the list output above is called block.

Subdivision

Blok 'abc', which can not be matched by re.han, will be returned as a result. Those connected with Chinese will move on to the next stage of subdivision.

DAG construction

The first step in subdivision is to build DAG, that is, directed acyclic graphs. The core code of the build is as follows:

Def get_DAG (self, sentence): self.check_initialized () # initialization Load dictionary DAG = {} N = len (sentence) for k in xrange (N): tmplist = [] I = k frag = sentence [k] while I < N and frag in self.FREQ: if self.FREQ [frag]: tmplist.append (I) I + = 1 frag = sentence [KRV I + 1] if not tmplist: tmplist.append (k) DAG [k] = tmplist return DAG

For example, when I came to Tsinghua University in Beijing, the DAG results are as follows:

{0: [0], 1: [1, 2], 2: [2], 3: [3, 4], 4: [4], 5: [5, 6, 8], 6: [6, 7], 7: [7, 8], 8: [8]}

Use dict to store graph data structures. The key in the dictionary is the index with no word corresponding to the sentence, and the value after it is a list that is the reachable path. For example, {1: [1jue 2]} means that the words "come" and "come" exist in the dictionary. Other analogies.

The generation of the graph depends on the variable self.FREQ, which stores the dictionary. Its structure is that the word is key, and the number of word occurrence is dict of value. Therefore, the quality of the dictionary will have a great influence on the result of word segmentation. If the path that does not exist is connected, the path that should be connected is not connected. The latter HMM model is also unable to solve the latter, of course, the maximum probability path will solve part of the first problem, but it is limited. So dictionaries are very critical.

Maximum probability path solution

With the DAG above, the following solution is to find the maximum probability path. There are many ways to solve this problem, and jieba uses dynamic planning. Instead of explaining what dynamic programming is, just look at the code

Def calc (self, sentence, DAG, route): n = len (sentence) route [N] = (0,0) logtotal = log (self.total) for idx in xrange (N-1,-1 -1): route [idx] = max ((log (self.FREQ.get [IDX: X + 1]) or 1)-logtotal + route [x + 1] [0], x) for x in DAG [idx])

The real process is only a few lines above. The key lies in the sentence of max. The problem does not unfold here. But there is a little trick to say: when operating on very small data, Python can also overflow downwards. What do you mean by the following example:

B = 0.0000001print baked goods 100

The result will print 0.0, so one way is to take log. This method is useful in many places. It also uses the technique of even tuple comparison.

The result of the solution is that if the parameters set at the time of word segmentation do not apply to the HMM model, it ends here. The result is as follows: key is also the corresponding index. The second one represents coming to the word.

{0: (- 32.587853155857076,0), 1: (- 27.379629658355885,2),} unknown words

The above maximum probability solves the ambiguity problem to some extent, but there is another problem in the participle, the unknown word problem is also called OOV (Out Of Vocabulary). Jieba uses HMM to identify unknown words. For example, the maximum probability path generated by the sentence "I came to Yu Xun Technology" is

{0: (- 42.29693140266269, 0), 1: (- 37.0887079051615,2), 2: (- 33.93639839927486, 2), 3: (- 28.257272562332492,3), 4: (- 17.872935353951055, 4), 5: (- 8.250710549196151,6), 6: (- 10.8815802160488346), 7: (0,0)}

See that 3 and 4 are independent words, and if you don't use the word HMM, it will be divided into two words. The logic is to combine multiple consecutive words into a new blok using HMM for segmentation. How on earth do you split HMM?

HMM

The definition of HMM hidden horse model can be checked by yourself, even after checking, you may not be able to tell exactly how to use the participle, but you definitely don't know if you don't check it. The corpus will be marked before word segmentation, and there are many ways to mark it. Most of them are that BMES corresponds to the beginning of the word B (begin), the middle of the word M (Middle), the end of the word E (End), the single word HMM of S (Single) has several concepts, and the corresponding relationship with the specific problem of word segmentation is as follows:

Status sequence (state sequence): BMES these statu

Observation sequence (observation sequence): sentences that need to be segmented. All the words form a sequence.

The problem now is to keep observing the sequence to find the state sequence. But in the first part, we need to build a HMM model. HMM has three basic components: initial probability state probability distribution A state transition matrix pi observation probability distribution B

If you have the above three elements, a HMM model is fixed. Of course, there are many assumptions in the HMM model, which are omitted here. How jieba got these three variables. This is the learning problem of HMM. Above the marked corpus. Maximum likelihood estimation can be used to estimate these three parameters. We can also see that the corpus is the key factor, and the quality of the corpus determines these three parameters. In fact, the process of estimation, regardless of the principle, is some statistical calculation. Jieba stores these three elements in three py files:

Prob_start.py: initial state probability prob_trans.py: state transition prob_emit.py: observation probability distribution

Take a look at prob_start:

P = {'Bamboo:-0.26268660809250016,' 3.14e+100:-3.14e+100, '3.14e+100:-1.4652633398537678}

-3.14e+100 stands for infinitesimal. It means that the first word can't be E, or M. It can only be estimated by using the frequency in the corpus.

At this point, I believe you have a deeper understanding of "jieba source code analysis in Python". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.