NLP lesson 2: sharp tools for Chinese word Segmentation jieba and HanLP 07/11 Update SLTechnology News&Howtos

NLP lesson 2: sharp tools for Chinese word Segmentation jieba and HanLP

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

preface

From this point on, we move on to the actual part. First, we obtain corpus according to the first step of Chinese natural language processing flow, and then focus on learning Chinese word segmentation. There are many kinds of Chinese word participles, such as NLPIR, LTP, THULAC, Stanford word participle, Hanlp word participle, jieba word participle, IKAnalyzer, etc. Here we introduce Chinese word segmentation application in different scenes for Jieba and HanLP respectively.

jieba installation

(1) There are three ways to install jieba under Python 2.x, as follows:

Fully automatic installation: Execute the command easy_install jieba or pip install jieba / pip3 install jieba to achieve fully automatic installation.

Semi-automatic installation: download jieba first, unzip it and run python setup.py install.

Manual installation: Place the jieba directory in the current directory or site-packages directory.

After installation, verify that the installation is successful by importing jieba.

(2) Python 3.x installation method.

The path to the Python 3.x version of jieba on Github is www.example.com. https://github.com/fxsjy/jieba/tree/jieba3k

Download it locally by git clone https://github.com/fxsjy/jieba.git command, and then extract it, and then enter the decompression directory by command line, execute python setup.py install command, and the installation will be successful.

jieba word segmentation algorithm

There are three main types:

Based on statistical dictionary, prefix dictionary is constructed, sentence is segmented based on prefix dictionary, and all segmentation possibilities are obtained. According to segmentation positions, a directed acyclic graph (DAG) is constructed.

Based on DAG graph, dynamic programming is used to calculate the maximum probability path (the most likely word segmentation result), and word segmentation is performed according to the maximum probability path.

For new words (words not in thesaurus), HMM model with Chinese character forming ability is used to segment them.

jieba participle

The first step is to introduce jieba and corpus:

import jieba content = "Today, machine learning and deep learning drive the rapid development of artificial intelligence and achieve great success in the field of image processing and speech recognition. "

(1) precise participle

Exact word segmentation: Exact mode attempts to slice sentences most accurately. Exact word segmentation is also the default word segmentation.

segs_1 = jieba.cut(content, cut_all=False)print("/".join(segs_1))

The results are:

Now/,/machine/learning/and/deep/learning/drive/artificial intelligence/rapid/development/,/and/in/pictures/processing/,/speech/recognition/field/achieve/great success/.

(2) Full mode

Full pattern segmentation: scanning out all possible words in a sentence, very fast, but unable to resolve ambiguity.

segs_3 = jieba.cut(content, cut_all=True) print("/".join(segs_3))

The results are:

Today/Today//Machine/Learning/And/Deep/Learning/Driving/Moving/Artificial/Artificial Intelligence/Intelligence/Rapid/Development///And/In/Picture/Processing///Speech/Recognition/Field/Achieving/Huge/Huge Success/Success/Success/

(3) Search engine model

Search engine mode: on the basis of precise mode, segmentation of long words again, improve recall rate, suitable for search engine segmentation.

segs_4 = jieba.cut_for_search(content) print("/".join(segs_4))

The results are:

Today/Today/,/Machine/Learning/and/Deep/Learning/Driving/Artificial/Intelligent/AI/Rapid/Development/,/And/In/Picture/Processing/,/Voice/Recognition/Domain/Achieving/Huge/Success/Huge Success/.

(4) Create a list with lcut

jieba.cut and jieba.cut_for_search return structures that are iterable Generators that can be used to obtain each word (Unicode) after segmentation. jieba.lcut encapsulates the result of cut, l stands for list, that is, the returned result is a list collection. Similarly, jieba.lcut_for_search returns the list collection directly.

segs_5 = jieba.lcut(content) print(segs_5)

The results are:

['Now',', 'machine',' learning','and',' deep','learning',' driving','artificial intelligence',' rapidly','development',', 'and',' in','picture',' processing',', ' speech','identification',' field','attain',' great success','. ']

(5) Obtaining part of speech

Jieba can easily obtain Chinese part-of-speech, and part-of-speech tagging can be realized through jieba.posseg module.

import jieba.posseg as psg print([(x.word,x.flag) for x in psg.lcut(content)])

The results are:

[('now ', ' t'),('，', 'x'),('machine ', ' n'),('learning ', ' v'),('and ', ' c'),('depth ', ' ns'),('learning ', ' v'),('drive ', ' v'),('AI ','n'),('rapid ', ' n'),('uj '),('development ', ' vn'),(',','x'),('and ', ' c'),('in ', ' p'),('picture ', ' n'),('process ', ' v'),(',','x'),('speech ', ' n'),('identification ', ' v'),('domain ', ' n'),('get ', ' v'),('great success','nr'),('. ', 'x')]

(6) Parallel participle

Parallel word segmentation principle is that after the text is separated by lines, it is distributed to multiple Python processes for parallel word segmentation, and finally the results are merged.

Usage:

jieba.enable_parallel(4) #Enable parallel segmentation mode, the parameter is the number of parallel processes. jieba.disable_parallel() #Turn off parallel word segmentation mode.

Note: Parallel word segmentation only supports default word segmenters jieba.dt and jieba.posseg.dt. Windows is currently not supported.

(7) Get the top n of the word list in the word segmentation result

from collections import Counter top5= Counter(segs_5).most_common(5) print(top5)

The results are:

[(',', 2),('learning', 2),('now ', 1),('machine', 1),('and ', 1)]

(8) Custom added words and dictionaries

By default, using the default word segmentation, is not recognized in this sentence "armored net" this new word, here using the user dictionary to improve the accuracy of word segmentation.

txt = "Tiejia. com is China's largest construction machinery trading platform. " print(jieba.lcut(txt))

The results are:

["armor","net is","china","largest","of","engineering machinery","trading platform",". ']

If you add a word to the dictionary, the result is different.

jieba.add_word("armor net") print(jieba.lcut(txt))

The results are:

["Iron Armor Net","Yes","China","Largest","of","Engineering Machinery","Trading Platform",". ']

However, if you want to add a lot of words, adding them one by one is not efficient enough. At this time, you can define a file, and then load the custom dictionary through the load_userdict() function, as follows:

jieba.load_userdict('user_dict.txt') print(jieba.lcut(txt))

The results are:

["Iron Armor Net","Yes","China","Largest","of","Engineering Machinery","Trading Platform",". ']

Notes:

The jieba.cut method accepts three input parameters: the string to be segmented; the cut_all parameter to control whether to use full patterns; and the HMM parameter to control whether to use HMM models.

The jieba.cut_for_search method takes two parameters: the string to be segmented and whether to use HMM models. This method is suitable for search engine to construct inverted index word segmentation, granularity is fine.

HanLP participle pyhanlp install

It is a Python interface for HanLP, supports automatic download and upgrade of HanLP, and is compatible with Python2 and Python 3.

The installation command is pip install pyhanlp, use the command hanlp to verify the installation.

pyhanlp currently uses the Python package jpype1 to call HanLP, if encountered:

building '_jpype' extensionerror: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft VisualC++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

It is recommended to use the lightweight Miniconda to download the compiled jpype1.

conda install -c conda-forge jpype1 pip install pyhanlp

Error when Java is not installed:

jpype. jvmfinder.JVMNotFoundException: No JVM shared library file (jvm.dll) found. Try setting up the JAVA HOME environment variable properly.

HanLP main project uses Java development, so Java runtime environment is required, please install JDK.

Command line interactive word segmentation mode

In the command line interface, use the command hanlp segment to enter the interactive word segmentation mode, enter a sentence and enter, HanLP will output the word segmentation result:

HanlP participle 1.png

HanLp participle 2.png

It can be seen that pyhanlp participle results are part-of-speech.

server mode

Start the built-in HTTP server via hanlp serve. The default local access address is http://localhost:8765.

Hanlp participle 3.png

enter image description here

You can also visit the official demo page: www.example.com. http://hanlp.hankcs.com/

Call common interfaces via tool class HanLP

Calling common interfaces through the tool class HanLP should be the most common way we use in our projects.

(1) participle

from pyhanlp import * content = "Today, machine learning and deep learning drive the rapid development of artificial intelligence and achieve great success in the field of image processing and speech recognition. " print(HanLP.segment(content))

The results are:

[Now/t,,/w, machine learning/gi, and/cc, depth/n, learning/v, driving/v, artificial intelligence/n, flying/d,/ude1, developing/vn,,/w, and/cc, in/p, pictures/n, processing/vn,,/w, speech/n, recognition/vn, field/n, acquiring/v, huge/a, success/a,./ w]

(2) Self-defined dictionary participle

Word participle when no custom dictionary is used.

txt = "Tiejia. com is China's largest construction machinery trading platform. " print(HanLP.segment(txt))

The results are:

[Armored/n, Net/n, Yes/vshi, China/ns, Max/gm, of/ude1, Engineering/n, Mechanical/n, Trading/vn, Platform/n,./ w]

Add custom words:

CustomDictionary.add("Iron Net") CustomDictionary.insert("Construction Machinery", "nz1024") CustomDictionary.add("Trading Platform", "nz 1024 n 1") print(HanLP.segment(txt))

The results are:

The requested URL/ns/max/gm/ude1/was not found on this server. w]

Of course, jieba and pyhanlp can do a lot of things, keyword extraction, automatic summarization, dependency syntax analysis, sentiment analysis, etc., which we will talk about later in the chapter, so I won't go into details here.

This article comes from Superman's blog.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.