Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to obtain the data of iqiyi TV series on-screen comment

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use Python to obtain the bullet screen data of iqiyi TV series". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. Now please follow the editor's train of thought to slowly deepen, together to study and learn "how to use Python to obtain iqiyi TV series bullet screen data"!

Looking for information on on-screen comment

Iqiyi's on-screen comment data already exists through a .z compressed file. First, use the following steps to find the on-screen comment url, tvid list, and then get the compressed file. Use tools to decompress, process, store and analyze the compressed files.

Absolutely, to implement multi-page crawling, you need to analyze the url law, use the url law to cycle the request and get the required content.

The url address of this on-screen comment file is

Https://cmts.iqiyi.com/bullet/93/00/6024766870349300_300_1.z

The universal form of url is

Url = 'https:

/ / cmts.iqiyi.com/bullet/ {} _ 300 _ {} .z 'where the first and second curly braces are the last 3 and 4 digits of tvid, and the last 1 and 2 digits. The third curly braces are tvid. The fourth curly braces are sub-file serial numbers, which are not an infinite number and will vary according to different TV dramas.

Enter the URL to get the compressed file of the contents of the on-screen comment.

Use the unzipped / compressed package zlib to extract and view the downloaded compressed files.

Import zlibfrom bs4 import BeautifulSoupwith open (r "C:\ Users\ HP\ Downloads\ 6024766870349300", 'rb') as fin: content = fin.read () btArr = bytearray (content) xml=zlib.decompress (btArr) .decode (' utf-8') bs = BeautifulSoup (xml, "xml") bs

Output

So tvid can easily get the data of the TV series'on-screen comment file as long as it gets it.

Import zlibfrom bs4 import BeautifulSoupimport pandas as pdimport requestsdef get_data (tv_name,tv_id): "get the tvid: param tv_name: number of sets per episode Episode 1, episode 2...: param tv_id: tvid: return: DataFrame for each episode, the final data "base_url = 'https://cmts.iqiyi.com/bullet/{}/{}/{}_300_{}.z' # create a new DataFrame head_data = pd.DataFrame with only headers (columns= [' uid','contentsId','contents'") 'likeCount']) for i in range (1Magol 20): url = base_url.format (tv_id [- 4RV Rhine 2], tv_id [- 2:], tv_id I) print (url) res = requests.get (url) if res.status_code = 200: btArr = bytearray (res.content) xml=zlib.decompress (btArr) .decode ('utf-8') # decompress the file bs = BeautifulSoup (xml, "xml") # BeautifulSoup Web page parsing data = pd.DataFrame (columns= [' uid','contentsId','contents' 'likeCount']) data [' uid'] = [i.text for i in bs.findAll ('uid')] data [' contentsId'] = [i.text for i in bs.findAll ('contentId')] data [' contents'] = [i.text for i in bs.findAll ('content')] data [' likeCount'] = [i.text for i in bs.findAll ('likeCount') ] else: break head_data = pd.concat ([head_data Data], ignore_index = True) head_data ['tv_name'] = tv_name return head_data to get tvid

The data of the on-screen comment file has been obtained through tvid above, so how to obtain tvid has become a problem. Don't worry, let's continue the analysis. Search tvid directly with Ctrl + F

So you can get the tvid directly from the returned result through a regular expression.

From requests_html import HTMLSession, UserAgentfrom bs4 import BeautifulSoupimport redef get_tvid (url): "" get tvid: param url: request URL: return: str, tvid "session = HTMLSession () # create HTML session object user_agent = UserAgent (). Random # create random request header header = {" User-Agent ": user_agent} res = session.get (url Headers=header) res.encoding='utf-8' bs = BeautifulSoup (res.text "html.parser") pattern = re.compile (". *? tvid.*? (\ d {16}). *?") # define the regular expression text_list = bs.find_all (text=pattern) # get the content for t in range (len (text_list)) through the regular expression: res_list = pattern.findall (text_ list [t]) if not res_list: Pass else: tvid = res_list [0] return tvid

This is the problem tvid. There is a tvid for each episode, and you can get as many tvid as you can. So here's the problem again: when you get the tvid, you send the request through url and get it from the returned result. And how to get the url for each episode.

Get url for each episode

Navigate to the set selection information through the element selection tool. The dynamic loading information is obtained by selenium simulation browser.

Some friends will say that you can get this href URL directly from the returned content, and you can try it yourself.

The result of Yunduojun's attempt is href= "_ javascript:void (0);", so one of the ways to solve this problem is to use selenium simulation browser to obtain the dynamic loading information of js.

Def get_javascript0_links (url, class_name, class_name_father, sleep_time=0.02): "" Selenium simulated user click crawl url: param url: target page: param class_name: simulated clicked class: param class_name_father: simulated clicked class This class is the parent class of class_name: param sleep_time: time left for the page to fall back: return: list, click the hyperlink that class enters for class_name "def wait (locator, timeout=15):"wait until the element loading completes" WebDriverWait (driver) Timeout) .browse (EC.presence_of_element_located (locator)) options = Options () # options.add_argument ("--headless") # No interface, if you need to view the contents of the interface You can comment out this line driver = webdriver.Chrome (options=options) driver.get (url) locator = (By.CLASS_NAME) Class_name) wait (locator) element = driver.find_elements_by_class_name (class_name_father) elements = driver.find_elements_by_class_name (class_name) link = [] linkNum = len (elements) for j in range (len (element)): wait (locator) driver.execute_script ("arguments [0] .click () ", element [j]) # simulated user Click for i in range (linkNum): print (I) wait (locator) elements = driver.find_elements_by_class_name (class_name) # get the element again to prevent StaleElementReferenceException driver.execute_script (" arguments [0] .click () " Elements [I]) # simulated user Click time.sleep (sleep_time) link.append (driver.current_url) time.sleep (sleep_time) driver.back () driver.quit () return linkif _ _ name__ = "_ _ main__": url = "https://www.iqiyi.com/v_1meaw5kgh4s.html" class_name = "qy-episode-num" link = get_javascript0_links (url Class_name, class_name_father= "tab-bar") for I, _ link in enumerate (link): print (I, _ link) main function

Next, all the steps are strung together by the main function.

The data obtained are as follows:

> data = main () > data.info () "" RangeIndex: 246716 entries 0 to 246715Data columns (total 5 columns): # Column Non-Null Count Dtype-0 tv_name 246716 non-null object 1 uid 246716 non-null object 2 contentsId 246716 non-null object 3 contents 246716 non-null object 4 likeCount 246716 non-null objectdtypes: object (5) memory usage: 9.4 + MB "" > > data.sample (10)

Word cloud picture

First participle

Use the Chinese thesaurus jieba to segment words and remove revoked words.

Def get_cut_words (content_series): "": param content_series: what needs a participle: return: list, click the hyperlink that class enters for class_name "" # "# read the disabled word list import jieba stop_words = [] with open (" stop_words.txt ",'r') Encoding='utf-8') as f: lines = f.readlines () for line in lines: stop_words.append (line.strip ()) # add the keywords my_words = ['Ni Ni', 'Liu Shishi', 'Lock', 'Jiang is three years old' 'Chen Daoming'] for i in my_words: jieba.add_word (I) # Custom stop word my_stop_words = ['ha','ha', 'really'] stop_words.extend (my_stop_words) # participle word_num = jieba.lcut (content_series.str.cat (sep=').) , cut_all=False) word_num_selected = [i for i in word_num if i not in stop_words and len (I) > = 2] # conditional filter return word_num_selected

Post-drawing

Use the upgraded version of the word cloud image library stylecloud to visualize the results of on-screen comments.

Import stylecloudfrom IPython.display import Image text1 = get_cut_words (content_series=data.contents) stylecloud.gen_stylecloud (text=' '.join (text1), collocations=False, font_path=r' ‪ C:\ Windows\ Fonts\ msyh.ttc', icon_name='fas fa-rocket',size=400 Output_name=' golden years-word cloud .png') Image (filename=' golden years-word cloud .png') Thank you for reading The above is the content of "how to use Python to obtain iqiyi TV series bullet screen data". After the study of this article, I believe you have a deeper understanding of how to use Python to obtain iqiyi TV series bullet screen data, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report