Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the python word segmentation tools and how to use them?

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "what are the python word segmentation tools and how to use them?" the editor shows you the operation process through actual cases. The operation method is simple, fast and practical. I hope that this article "what are the python word segmentation tools and how to use them" can help you solve the problem.

1. Jieba participle

"stuttering" word segmentation, GitHub's most popular word segmentation tool, is determined to be the best Python Chinese word segmentation component, supporting multiple word segmentation modes and supporting custom dictionaries.

Github star:26k

Code example

Import jieba

Strs= ["I came to Tsinghua University in Beijing", "Table Tennis auction is over", "University of Science and Technology of China"]

For str in strs:

Seg_list = jieba.cut (str,use_paddle=True) # uses paddle mode

Print ("Paddle Mode:" +'/ '.join (list (seg_list)

Seg_list = jieba.cut ("I came to Tsinghua University in Beijing", cut_all=True)

Print ("full mode:" + "/" .join (seg_list)) # full mode

Seg_list = jieba.cut ("I came to Tsinghua University in Beijing", cut_all=False)

Print ("precise mode:" + "/" .join (seg_list)) # precise mode

Seg_list = jieba.cut ("he came to NetEase Hangyan Mansion") # default is precise mode

Print ("New word recognition:", ".join (seg_list))

Seg_list = jieba.cut_for_search ("Xiao Ming Master graduated from the Institute of Computing, Chinese Academy of Sciences and later studied at Kyoto University in Japan") # search engine model

Print ("search engine pattern:", .join (seg_list))

Output:

[full mode]: I / come / Beijing / Tsinghua / Tsinghua University / Huada / University

[precise model]: I / come / Beijing / Tsinghua University

[new word recognition]: he, came, NetEase, Hangyan, Mansion (here, "Hangyan" is not in the dictionary, but it is also recognized by Viterbi algorithm)

[search engine model]: Xiao Ming, Master, graduated, Yu, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, Computing, Institute of Computing, Post, in Japan, Kyoto, University, Kyoto University, Japan, further study 2. Pkuseg participle

Pkuseg is an open source word segmentation tool of Peking University language Computing and Machine Learning Research Group. It is characterized by supporting multi-domain word segmentation. At present, it supports word segmentation pre-training models in news, network, medicine, tourism, and mixed fields. Users are free to choose different models. Compared with the general word segmentation tool, its word segmentation accuracy is higher.

Github star:5.4k

Code example

Import pkuseg

Seg = pkuseg.pkuseg () # load the model with the default configuration

Text = seg.cut ('python is a great language') # for word segmentation

Print (text)

Output

['python',' is', 'one', 'door', 'very', 'great', 'language'] 3. FoolNLTK participle

Trained based on the BiLSTM model, it is said to be probably the most accurate open source Chinese word segmentation, and also supports user-defined dictionaries.

GitHub star: 1.6k

Code example

Import fool

Text = "A fool is in Beijing"

Print (fool.cut (text))

# ['one', 'fool','in', 'Beijing'] 4. THULAC

THULAC is a set of Chinese lexical analysis toolkit developed by the Laboratory of Natural language processing and Social Humanities Computing of Tsinghua University. It has the function of part-of-speech tagging and can analyze whether a word is a noun or a verb or an adjective.

Github star:1.5k

Code sample 1

Code sample 1

Import thulac

Thu1 = thulac.thulac () # default mode

Text= thu1.cut ("I love Tiananmen Square in Beijing", text=True) # carries on a sentence participle

Print (text) # I _ r love _ v Beijing _ ns Tiananmen _ ns

Code sample 2

Thu1 = thulac.thulac (seg_only=True) # word segmentation only, no part of speech tagging

Thu1.cut_f ("input.txt", "output.txt") # participle the contents of the input.txt file and output them to output.txt

This is the end of the content about "what are the python word segmentation tools and how to use them?" Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report