How to use NLP basic tool jieba 07/03 Update SLTechnology News&Howtos

How to use NLP basic tool jieba

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to use jieba, the basic tool of NLP. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

Jieba is an open source library developed by Baidu engineer Sun Junyi. It is very popular and frequently used on GitHub.

GitHub link: https://github.com/fxsjy/jieba

The most popular application of jieba is word segmentation, including "stuttering Chinese word segmentation" on the introduction page, but in addition to word segmentation, jieba can also do keyword extraction, word frequency statistics and so on.

Jieba supports four word segmentation modes:

Precise mode: try to cut the sentence most accurately and output only the maximum probability combination

Search engine model: on the basis of accurate model, the long words are segmented again to improve the recall rate, which is suitable for search engine word segmentation.

Full mode: scan out all the words in the sentence that can be formed into words

-paddle mode, using PaddlePaddle deep learning framework, training sequence tagging (bi-directional GRU) network model to realize word segmentation. Part of speech tagging is also supported.

Code:

Output:

Code:

Output

As can be seen from the above example:

-precise mode is a common and default way of word segmentation

-the search engine model is segmented in more detail, including Tsinghua University, Huada University, University, China, Science, College, etc.

-full mode is more comprehensive than search engine model, listing all possibilities

-the paddle mode is close to the exact mode.

In addition, jieba supports:

-traditional Chinese participle

-Custom dictionary

Installation:

Pip/pip3/easy_installinstall jieba

Use:

Importjieba # Import jieba

Importjieba.posseg as pseg # part of speech tagging

Importjieba.analyse as anls # keyword extraction

Arithmetic

Based on the prefix dictionary, an efficient word graph scan is realized to generate a directed acyclic graph (DAG) formed by all possible word formation of Chinese characters in a sentence.

Dynamic programming is used to find the maximum probability path, and the maximum segmentation combination based on word frequency is found.

For unknown words, the HMM model based on the word formation ability of Chinese characters is adopted, and the Viterbi algorithm is used.

After reading the above, do you have any further understanding of how to use jieba, the basic tool of NLP? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.