Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How python crawls bilibili's on-screen comment to make word cloud

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how python crawls bilibili's on-screen comment to make word clouds. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

If you need to know cid, you can refresh it with F12J F5, find cid, and splice url after finding it.

You can also write code, parse response to get cid, and then splice it together.

You can use either requests or urllib

I am using requests to request the link to get the xml file

Code: get xml

Def get_data ():

Res = requests.get ('http://comment.bilibili.com/6315651.xml')

Res.encoding = 'utf8'

With open ('gugongdanmu.xml',' asides, encoding='utf8') as f:

F.writelines (res.text)

Parsing xml

Def analyze_xml ():

F1 = open ("gugongdanmu.xml", "r", encoding='utf8')

F2 = open ("tanmu2.txt", "w", encoding='utf8')

Count = 0

# regular matching solves the extra characters of xml

Dr = re.compile (r'] + >', re.S)

While 1:

Line = f1.readline ()

If not line:

Break

Pass

# replace it with null after matching

Dd = dr.sub ('', line)

# dd = re.findall (dr, line)

Count = count+1

F2.writelines (dd)

Print (count)

Get rid of useless characters and numbers and find all the Chinese characters

Def analyze_hanzi ():

F1 = open ("tanmu2.txt", "r", encoding='utf8')

F2 = open ("tanmu3.txt", "w", encoding='utf8')

Count = 0

# dr = re.compile (r'] + >', re.S)

# all Chinese characters [one-word]

Dr = re.compile (r'[mono] +', re.S)

While 1:

Line = f1.readline ()

If not line:

Break

Pass

# find useless symbols and numbers

# dd = dr.sub ('', line)

Dd = re.findall (dr, line)

Count = count+1

F2.writelines (dd)

Print (count)

# pattern = re.compile (r'[one-two] +')

Use jieba word segmentation to generate word clouds

Def show_sign ():

Content = read_txt_file ()

Segment = jieba.lcut (content)

Words_df = pd.DataFrame ({'segment': segment})

Stopwords = pd.read_csv ("stopwords.txt", index_col=False, quoting=3, sep= "", names= ['stopword'], encoding='utf-8')

Words_df = words_df [~ words_df.segment.isin (stopwords.stopword)]

Print (words_df)

Print ('-')

Words_stat = words_df.groupby (by= ['segment']) [' segment'] .agg (numpy.size)

Words_stat = words_stat.to_frame ()

Words_stat.columns = ['count']

Words_stat = words_stat.reset_index () .sort_values (by= ["count], ascending=False)

# set word cloud attributes

Color_mask = imread ('ciyun.png')

Wordcloud = WordCloud (font_path= "simhei.ttf", # sets the font to display Chinese

Background_color= "white", # background color

Max_words=1000, the maximum number of words displayed in the word cloud

Mask=color_mask, # set the background image

Max_font_size=100, # font maximum

Random_state=42

Width=1000, height=860, margin=2

# set the default size of the picture, but if you use a background picture, # then the saved picture size will be saved according to its size, and margin is the edge distance of the word.

)

# generate word cloud, you can use generate to input all text, or we can use generate_from_frequencies function after calculating word frequency

Word_frequence = {x [0]: X [1] for x in words_stat.head (1000) .values}

Print (word_frequence)

# for key,value in word_frequence:

# write_txt_file (word_frequence)

Word_frequence_dict = {}

For key in word_frequence:

Word_frequence_ keys = word_ requests [key]

Wordcloud.generate_from_frequencies (word_frequence_dict)

# generate color values from background images

Image_colors = ImageColorGenerator (color_mask)

# recolor

Wordcloud.recolor (color_func=image_colors)

# Save the picture

Wordcloud.to_file ('output.png')

Plt.imshow (wordcloud)

Plt.axis ("off")

Plt.show ()

Run the program, the result:

The result of statistics

Done!

Change the source of pip, the original one is too slow, and then install it without your own library.

This is the end of the article on "how python climbs bilibili's on-screen comment to make word cloud". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report