In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how Python crawls the video website and makes the word cloud map. The content of the article is of high quality, so the editor will share it for you to do a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
Hello, everyone. Recently, there have been so many "melons" that we were caught off guard, which can be described as "hot spots" all the time. As programmers, we may also work overtime for this at any time.
All kinds of comment videos "explode" the network, open the home page is full of hot topics, a well-known UP main release video will be wrong off the peak.
I was thinking: is there a general analysis method for such hot topics and content? The answer is: grab the barrage or comment.
Let's take the video barrage of Bing Bing vlog as an example to analyze.
I. acquisition method
1. Web page parsing: the structure of a web page may change at any time.
2.python third party api: there may be problems with maintenance.
After a simple comparison, I chose the first method.
Second, web page analysis
The key to climbing the barrage is to get the cid of the video, which is also called oid in some places. It is not difficult to find the cid of the video through the developer mode of the browser. We can climb all the on-screen comments of the video through the cid+.xml of the https://comment.bilibili.com/+ video.
III. Download and analysis of on-screen comments
Since the content of the barrage is concentrated in the xml file, we need to download the file and parse the file using xpath.
From lxml import etreeimport requestsimport timeimport jiebaimport numpy as npfrom PIL import Imagefrom wordcloud import WordCloud as wcclass Bilibili (): "" docstring for Bilibili "" def _ _ init__ (self,oid): self.headers= {'Host':' api.bilibili.com', 'Connection':' keep-alive', 'Cache-Control':' max-age=0', 'Upgrade-Insecure-Requests':' 1' 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0 Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36', 'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding':' gzip, deflate, br', 'Accept-Language':' zh-CN,zh;q=0.9', 'Cookie':' finger=edc6ecda; LIVE_BUVID=AUTO1415378023816310 Stardustvideo=1; CURRENT_FNVAL=8; buvid3=0D8F3D74-987D color 442D rpdid=olwimklsiidoskmqwipww 99CF color 42BC9A967709149017 focus Fts=1537803390'} self.url=' https://api.bilibili.com/x/v1/dm/list.so?oid='+str(oid) self.barrage_reault=self.get_page () # get information def get_page (self): try: # delay operation Prevent crawling time.sleep (0.5) response=requests.get (self.url,headers=self.headers) except Exception as e: print ('failed to get xml content,% s'% e) return False else: if response.status_code = = 200: # download xml file with open ('bilibili.xml') 'wb') as f: f.write (response.content) return True else: return False # parsing web pages def param_page (self): time.sleep (1) if self.barrage_reault: # file path Html parser html=etree.parse ('bilibili.xml',etree.HTMLParser ()) # xpath parsing to get all the text content under the current d tag results=html.xpath (' / / d//text ()') return results 4. On-screen comment
Repeated on-screen comments are classified, and new categories are created for those that have not appeared before. Prepare for word frequency statistics and word clouds.
# on-screen comment def remove_double_barrage (self):''double_arrage: collection of all repeated on-screen comments results: removed on-screen comments barrage: each type of on-screen comments is stored once' 'double_barrage= [] results= [] barrage=set () for result in self.param_page (): if result not in results: results.append (result) Else: double_barrage.append (result) barrage.add (result) return double_barrage Results, barrage V, statistics of repetition times on barrage and production of word cloud
We take a picture of "Wang Bingbing" on the Internet and simply deal with it as an outline of the word cloud.
# on-screen comment repetition and word cloud production def make_wordCould (self): double_barrages,results,barrages=self.remove_double_barrage () # duplicates count with open ('barrages.txt' 'w') as f: for barrage in barrages: amount=double_barrages.count (barrage) f.write (barrage+':'+str (amount+1) +'\ n') # set the stop word stop_words= ['[',']','.','.','!','.] Words= [] if results: for result in results: for stop in stop_words: result=''.join (result.split (stop)) words.append (result) # the list is spliced into a string words=''.join (words) words=jieba.cut (words) words=''.join (words) bingbing=np.array (Image .open ('Bing.jpg') w=wc (font_path=' CJV bank bank, SIMYOU.TTF' Background_color='white', width=900, height=600, max_font_size=15, min_font_size = 1, max_words=3000, mask=bingbing) w.generate (words) w.to_file ('bingbing.jpg') b=Bilibili (283851334) # cidb.make_wordCould () # drawing word cloud of video
Statistical results:
About how Python crawls the video website and makes the word cloud picture to share here, hope that the above content can have certain help to everybody, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.