In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail how python crawls bilibili's on-screen comment to make word clouds. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.
If you need to know cid, you can refresh it with F12J F5, find cid, and splice url after finding it.
You can also write code, parse response to get cid, and then splice it together.
You can use either requests or urllib
I am using requests to request the link to get the xml file
Code: get xml
Def get_data ():
Res = requests.get ('http://comment.bilibili.com/6315651.xml')
Res.encoding = 'utf8'
With open ('gugongdanmu.xml',' asides, encoding='utf8') as f:
F.writelines (res.text)
Parsing xml
Def analyze_xml ():
F1 = open ("gugongdanmu.xml", "r", encoding='utf8')
F2 = open ("tanmu2.txt", "w", encoding='utf8')
Count = 0
# regular matching solves the extra characters of xml
Dr = re.compile (r'] + >', re.S)
While 1:
Line = f1.readline ()
If not line:
Break
Pass
# replace it with null after matching
Dd = dr.sub ('', line)
# dd = re.findall (dr, line)
Count = count+1
F2.writelines (dd)
Print (count)
Get rid of useless characters and numbers and find all the Chinese characters
Def analyze_hanzi ():
F1 = open ("tanmu2.txt", "r", encoding='utf8')
F2 = open ("tanmu3.txt", "w", encoding='utf8')
Count = 0
# dr = re.compile (r'] + >', re.S)
# all Chinese characters [one-word]
Dr = re.compile (r'[mono] +', re.S)
While 1:
Line = f1.readline ()
If not line:
Break
Pass
# find useless symbols and numbers
# dd = dr.sub ('', line)
Dd = re.findall (dr, line)
Count = count+1
F2.writelines (dd)
Print (count)
# pattern = re.compile (r'[one-two] +')
Use jieba word segmentation to generate word clouds
Def show_sign ():
Content = read_txt_file ()
Segment = jieba.lcut (content)
Words_df = pd.DataFrame ({'segment': segment})
Stopwords = pd.read_csv ("stopwords.txt", index_col=False, quoting=3, sep= "", names= ['stopword'], encoding='utf-8')
Words_df = words_df [~ words_df.segment.isin (stopwords.stopword)]
Print (words_df)
Print ('-')
Words_stat = words_df.groupby (by= ['segment']) [' segment'] .agg (numpy.size)
Words_stat = words_stat.to_frame ()
Words_stat.columns = ['count']
Words_stat = words_stat.reset_index () .sort_values (by= ["count], ascending=False)
# set word cloud attributes
Color_mask = imread ('ciyun.png')
Wordcloud = WordCloud (font_path= "simhei.ttf", # sets the font to display Chinese
Background_color= "white", # background color
Max_words=1000, the maximum number of words displayed in the word cloud
Mask=color_mask, # set the background image
Max_font_size=100, # font maximum
Random_state=42
Width=1000, height=860, margin=2
# set the default size of the picture, but if you use a background picture, # then the saved picture size will be saved according to its size, and margin is the edge distance of the word.
)
# generate word cloud, you can use generate to input all text, or we can use generate_from_frequencies function after calculating word frequency
Word_frequence = {x [0]: X [1] for x in words_stat.head (1000) .values}
Print (word_frequence)
# for key,value in word_frequence:
# write_txt_file (word_frequence)
Word_frequence_dict = {}
For key in word_frequence:
Word_frequence_ keys = word_ requests [key]
Wordcloud.generate_from_frequencies (word_frequence_dict)
# generate color values from background images
Image_colors = ImageColorGenerator (color_mask)
# recolor
Wordcloud.recolor (color_func=image_colors)
# Save the picture
Wordcloud.to_file ('output.png')
Plt.imshow (wordcloud)
Plt.axis ("off")
Plt.show ()
Run the program, the result:
The result of statistics
Done!
Change the source of pip, the original one is too slow, and then install it without your own library.
This is the end of the article on "how python climbs bilibili's on-screen comment to make word cloud". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it out for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.