Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How Python crawled to bilibili's on-screen comment

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how Python climbed to bilibili's on-screen comment, which has certain reference value. Interested friends can refer to it. I hope you will gain a lot after reading this article. Let the editor take you to understand it.

The text and pictures of this article come from the network, only for learning, communication and use, do not have any commercial use, if you have any questions, please contact us in time to deal with.

Preface

Presumably everyone in the small broken station is familiar with the trumpet, which is full of magical videos, in which the "bullet screen" has become a source of happiness for many people. There are also many reptiles about it, but most of them are limited by the number of barrage pools, and only a few of them can be crawled.

So, is there any way to climb more of bilibili's on-screen comment?

(1) capturing the barrage (limited in quantity)

First of all, we need to find the interface of bilibili's video on-screen comment. Through the F12 debugging tool of the browser, we can find that the interface is:

Https://api.bilibili.com/x/v1/dm/list.so?oid={oid/cid}

In fact, in addition to this interface, there is another interface that can also obtain on-screen comments:

Https://comment.bilibili.com/{oid/cid}.xml

Where "oid" and "cid" are the id numbers assigned to each video by the iB station, but usually what we see in the browser address is "bvid", so we need to make a conversion:

There are many related APIs. You can define the following functions:

Def get_cid (bvid):''obtain video cid input through video bvid: video bvid output: video cid' url =' https://api.bilibili.com/x/player/pagelist?bvid=%s&jsonp=jsonp'%bvid res = requests.get (url) data = res.json () return data ['data'] [0] [' cid']

With "cid", we can crawl the on-screen comment through the API just found. The code is as follows:

Oid = get_cid (bvid) # where cid and oid are the same url = 'https://api.bilibili.com/x/v1/dm/list.so?oid=%d'%oidres = requests.get (url) res.encoding =' utf-8'text = res.text

Note: you need to specify the encoding method of res as utf-8, otherwise garbled will occur.

The request we get is an xml file, which has the following format:

So, using rules, they can all come out ahead of time:

Def parse_dm (text):''parse video on-screen comment input: original data output of video on-screen comment: analysis result of on-screen comment' result = [] # used to store the analysis result data = re.findall ((. *?)', text) for d in data: item = {} # each piece of on-screen comment data dm = d [0] .split (' ') # details of the on-screen comment If the time of appearance Users, etc. Item ['occurrence time'] = float (dm [0]) item ['mode'] = int (dm [1]) item ['font'] = int (dm [2]) item ['color'] = int (dm [3]) item ['comment time'] = time.strftime ('% Y-%m-%d% HRV% MVO% S' Time.localtime (int (dm [4])) item ['on-screen comment pool'] = int (dm [5]) item ['user ID'] = dm [6] # is not a real user ID Instead, the hexadecimal result item ['rowID'] = dm [7] # after CRC32 verification is ID in the database for the "historical on-screen comment" function item [' on-screen comment'] = d [1] result.append (item) return result

By parsing the request of requests, you can get the corresponding on-screen comment:

Dms = parse_dm (text) # parsing on-screen comment

However, here is limited by the barrage pool, can only grab a small part at a time, when there are a large number of barrage, obviously this method will not work.

Well, we have to find another way.

(2) capturing the barrage (no quantity limit)

Through the analysis, we can find another interface that can be used to obtain daily on-screen history data:

Https://api.bilibili.com/x/v2/dm/history?type=1&oid={oid/cid}&date=xx-xx

"date" is the date, and you can get more on-screen comments by traversing the date. It should be noted that this API needs to be logged in, so cookies must be added when the request is made. The following functions can be defined:

Def get_history (bvid,date):''get video history on-screen comment input: video bvid, date output: video on-screen comment starting from a certain date' 'oid = get_cid (bvid) url =' https://api.bilibili.com/x/v2/dm/history?type=1&oid=%d&date=%s'%(oid,date) headers = {'User-Agent':' Mozilla/5.0 (Windows NT 6.3) Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0', 'Accept':' * / *', 'Accept-Language':' zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en Qroom0.2, 'Origin':' https://www.bilibili.com', 'Connection':' keep-alive', 'Referer':' https://www.bilibili.com/video/BV1k54y1U79J', 'TE':' Trailers'} # this API needs to be logged in So you need cookies cookies= {} res = requests.get (url,headers=headers,cookies=cookies) res.encoding = 'utf-8' text = res.text dms = parse_dm (text) # to parse the on-screen comment return dms

If you want to get more on-screen comments, you can traverse the data for each day, but there are two problems: "too inefficient" and "data duplication". Because we usually get data that spans several days with each acquisition, we don't have to visit it every day, but we can gradually push forward based on the results:

Def get_dms (bvid):''get video on-screen comment (the number of on-screen comments obtained by this method is more) input: bvid output of video: on-screen comment of video' print ('video parsing...') Info = get_info (bvid) print ('video parsing complete!') Print ('[video title]:% s\ n [video playback]:% d\ n [number of on-screen comments]:% d\ n [upload date]:% info% (info [0], info [1], info [2]) Info [3])) dms = get_dm (bvid) # Storage on-screen comment if len (dms) > = info [2]: # if the number of on-screen comments has been filled with return dms else: dms = [] date = time.strftime ('% Ymuri% mmuri% daojin.localtime (time.time () # starting from today, while True: dm = get_history (bvid) Date) dms.extend (dm) print ('% s'on-screen crawl completed! (% d)'% (date) Len (dm)) if len (dm) = 0: # if empty break end = dm [- 1] ['comment time'] .split () [0] # take the date of the last on-screen comment if end = = date: # if the last one is still the same day Then push down one day end = (datetime.datetime.strptime (end,'%Y-%m-%d')-datetime.timedelta (days=1)). Strftime ('% Ymuri% mmi% d') if end = = info [3]: # if you have reached the upload day break else: date = end dm = get_history (bvid Info [3]) # avoid ignoring some of the data from the day of upload dms.extend (dm) print ('on-screen comment crawling is complete! (total% d)'% len (dms)) print ('data de-duplicates...') Dms = del_repeat (dms,'rowID') # Press rowID to remove the print from the barrage ('data is deduplicated! (total% d)'% len (dms)) return dms

Run the main function:

If _ _ name__ = ='_ main__': dms = get_dms ('BV1HJ411L7DP') dms = pd.DataFrame (dms) dms.to_csv (' all the way north.csv', index=False)

The crawling process is as follows:

Thank you for reading this article carefully. I hope the article "how Python climbs to bilibili's on-screen comment" shared by the editor is helpful to everyone. At the same time, I also hope you can support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report