What is the basic usage of Python and R of stuttering participle 04/15 Update SLTechnology News&Howtos

What is the basic usage of Python and R of stuttering participle

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

What this article shares to you is about the basic usage of Python and R language of stuttering participle. The editor thinks it is very practical, so I share it with you to learn. I hope you can learn something after reading this article.

People do not speak from one word to another, and the article is made up of sentences. It is not impossible for machines to recognize American texts and experience the breadth and profoundness of the famous Chinese language. However, first of all, it needs to be transformed into a pattern that it can recognize-words. Word segmentation is the lowest and most basic module in natural language processing (NLP). The accuracy of word segmentation will directly affect the results of text analysis.

This paper introduces the famous word segmentation method: stuttering word segmentation and the usage of basic word segmentation in Python and R language.

Stuttering participle in Python

Three commonly used modes of Python Chinese word Segmentation

Precise mode:

Full mode

Search engine mode

The three modes all use hidden Markov model word segmentation by default; at the same time, stuttering word segmentation supports traditional word segmentation and custom dictionary methods.

Import module: import jieba

(1) exact mode:

> test = 'Mount Wudang, the birthplace of Taoism in Shiyan'

> cut1 = jieba.cut (test)

> type (cut1)

> print ('accurate word segmentation result:', '.join (cut1))

Cut1 cannot be viewed directly. The join (cut1) function means that the elements in the cut1 are separated by spaces and then can be viewed by the print () function.

Accurate word segmentation result: Wudang Mountain, the birthplace of Taoism in Shiyan

(2) full mode:

> cut2 = jieba.cut (test,cut_all = True)

> print ('full-mode word segmentation result:', '.join (cut2))

The result of full-model word segmentation: there is the birthplace of Taoism in Shiyan, Wudang Mountain.

All possible words are taken into account. "precise mode" actually has a default parameter cut_all = False. Obviously, regardless of whether there is ambiguity in the meaning after word segmentation, the full model is not suitable for text analysis, just to quickly sort out all possible words.

(3) search engine mode:

> cut3 = jieba.cut_for_search (test)

> print ('search engine pattern word segmentation results:', '.join (cut3))

Search engine pattern word segmentation results: Wudang Mountain, the birthplace of Taoism in Shiyan

The search engine pattern also gives all possible word segmentation results, but the search engine pattern can give correct word segmentation results for words that do not exist in the dictionary, such as some rare and new words.

Add Custom Dictionary

Path = 'dictionary path'

Jieba.load_userdict (path3)

Then the word can be segmented.

Extract keywords

For an article, extract keywords, such as specifying 5 keywords:

Jieba.analyse.extract_tags (dat,topK = 5)

Note: personally, the dictionary is generally .txt, the default is ASCII format, should be saved as utf8 mode, why? There is Chinese. Actually, it's in the help document.

More usage: help ('jieba') for more details.

R language stuttering participle

The R language version of "stuttering" Chinese word segmentation supports maximum probability method, implicit Markov model, index model, hybrid model, and has the functions of part-of-speech tagging, keyword extraction, text Simhash similarity comparison and so on.

Download and install the package:

> install.packages ('jiebaRD')

> install.packages ('jiebaR')

> library (jiebaRD)

> library (jiebaR)

Word segmentation

> test seg seg segment (test,seg)

[1] "Revolution", "not yet", "successful", "Comrade", "still"need" and "efforts"

That is, there are two ways to write it:

(1), segseg test segment (test,seg)

Vn d a n zg v ad

"Revolution", "not yet", "successful", "Comrade", "still"need" and "efforts"

Here seg seg2 keywords (test,seg2)

6.13553

"comrade"

Distance between Simhash and hamming:

The corresponding simhash value is calculated for the Chinese document. Simhash is an algorithm used by Google for text deduplication, and it is now widely used in text processing. The Simhash engine first extracts word segmentation and keywords, and then calculates Simhash value and hamming distance.

> test seg3 simhash (test,seg3)

$simhash

[1] "13489182016966018967"

$keyword

6.13553 6.0229

"Comrade" and "hard"

List participle:

Support for word segmentation of multiple lists (each element is text) at one time.

Test2 apply_list (list (test,test2), seg)

[[1]]

Vn d a n zg v ad

"Revolution", "not yet", "successful", "Comrade", "still"need" and "efforts"

[[2]]

Ns v n n ns

"Shiyan", "you", "Taoism", "birthplace" and "Wudang Mountain"

Remove the stop word

> seg

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.