Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to generate cloud words with python pkuseg

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use python pkuseg to generate cloud words", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use python pkuseg to generate cloud words.

Install pkusegpip3 install pkuseg

The first step is to download the speech, save it to a txt file, and then load the content into memory

Content = []

With open ("yanjiang.txt", encoding= "utf-8") as f:

Content = f.read ()

I counted and found that the total number of words was 32546.

Next, we use pkuseg to segment the content, and count the top 20 words with the highest frequency.

Import pkuseg

From collections import Counter

Import pprint

Content = []

With open ("yanjiang.txt", encoding= "utf-8") as f:

Content = f.read ()

Seg = pkuseg.pkuseg ()

Text = seg.cut (content)

Counter = Counter (text)

Pprint.pprint (counter.most_common (20))

Output result:

What the heck, what are these things? don't worry. In fact, ah, there is also a concept in the field of participle called stop words. The so-called stop words are words that have no specific meaning in the context, such as this, that, you, me, him, land, and punctuation match, and so on. Because no one is searching with these meaningless stop words when searching, in order to make the word segmentation better, we have to get rid of these stop words, and let's go to the Internet to find a stop thesaurus.

The second version code:

Import pkuseg

From collections import Counter

Import pprint

Content = []

With open ("yanjiang.txt", encoding= "utf-8") as f:

Content = f.read ()

Seg = pkuseg.pkuseg ()

Text = seg.cut (content)

Stopwords = []

With open ("stopword.txt", encoding= "utf-8") as f:

Stopwords = f.read ()

New_text = []

For w in text:

If w not in stopwords:

New_text.append (w)

Counter = Counter (new_text)

Pprint.pprint (counter.most_common (20))

Printed result:

[(Wechat, 163)

('user', 112)

('products', 89)

('Friends', 81)

(tools, 56)

('Program', 55)

('socialize', 55)

(circle, 47)

(video, 40)

(hope, 39)

(time, 39)

('Game', 36)

('read', 33)

('content', 32)

('platform', 31)

(article, 30)

('Information', 29)

('team', 27)

('AI', 27)

('APP', 26)]

It looks much better than the first time, because the stop words have been filtered out, which is a bit similar to the picture of copper mining, but the words he picked out may come from another dimension, after all, he is a psychologist. But the first 20 high-frequency words we selected are still not accurate, and some words that should not be segmented, such as moments, official account, Mini Program and so on, are also split. We think this is a whole.

For these proper nouns, we only need to specify a user dictionary, when word segmentation, the words in the user dictionary are fixed and re-segmented.

Lexicon = ['Mini Program', 'moments', 'official account'] #

Seg = pkuseg.pkuseg (user_dict=lexicon) # load model, given user dictionary

Text = seg.cut (content)

The final result is that the first 50 high-frequency words are like this.

163 Wechat

112 users

89 products

72 moments

56 tools

55 socializing

53 Mini Program

40 Video

39 Hope

39 hours

36 games

33 Reading

32 content

31 friends

31 platform

30 articles

29 Information

27 team

27 AI

26 APP

25 official account

25 Servic

24 good friends

22 photos

The 21st era

21 record

20 mobile phone

20 recommendation

20 enterprises

19 motive force

18 function

18 True

18 Life

17 flow

16 computer

15 space

15 discovery

15 creativity

15 embodiment

15 companies

15 value

Version 14

14 share

14 Future

13 Internet

13 release

13 ability

13 discussion

13 dynamic

12 Design

Zhang Xiaolong's most frequently spoken words are users, friends, motivation, value, sharing, creativity, discovery and so on. Users appear 112 times, hope 39 times, and friends 31 times. These words are the spirit of the Internet. If we make these words into words, the effect may be better.

At this point, I believe you have a deeper understanding of "how to use python pkuseg to generate cloud words". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report