Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize Hot spot crawling of short Video by Python+Selenium

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "how to achieve short video hotspot crawling by Python+Selenium". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Knowledge points involved

1.selenium, as an automated testing tool on the browser side, can simulate the actions of users operating the browser, just like people operating the browser themselves. The specific information about selenium is as follows

There are mainly 8 ways to locate elements in Selenium, such as ID,Name,ClassName,Css Selector,Partial LinkText,LinkText,XPath,TagName.

Selenium can get a single element (such as find_element) and an array of elements (such as find_elements).

After the Selenium element is located, you can assign and take values to the element, or perform corresponding event operations (such as: click).

2. Requestsforce web request object, obtain the url of the video through selenium, then obtain the video stream through the requests library, and then save the cost video file.

3. Browser developer tool, through the developer tool, you can view the html logo of a certain button or link and other page elements on the page.

Target analysis

Before crawling a video, you need to analyze the target structure. The crawling analysis of this video can be divided into three steps, as shown below:

1. Analyze the hot list catalogue

The hot list directory is a ul tag, and each hot list object has a li sub-tag, which respectively contains hot, title and other content. Click the title link to enter the specific video playback page. The target analysis is as follows:

two。 Analyze the video playback page

The video is played in the video tag, and the real address of the short video is played in the Source sub-tag of video. To ensure the playback quality, there are three source under the video. You can choose any one of them, as shown below:

3. Analyze pop-up box

In the process of crawling, the window that pops up and needs to be logged in needs to be closed in time, otherwise the page element may not be found and the crawl will not be successful. As follows:

Core code

After the above analysis, you can write the crawler code, as follows:

1. Traverse the hotspot directory

Parse the directory of hot videos by obtaining the corresponding information on the page, as shown below:

Self.__driver.get (self.__url) self.close_popup_window () # 4. Maximize window self.__driver.maximize_window () time.sleep (self.__wait_sec) # after opening Obtain liif self.checkIsExistsByClass (cls='BHgRhxNh') under ul according to class=BHgRhxNh: # get hots = self.__driver.find_elements (by=By.CLASS_NAME, value='BHgRhxNh') hot_infos = [] index = 0 for hot in hots: hot_info = {} a = hot.find_element (by=By.TAG_NAME) Value='a') href = a.get_attribute ("href") text = a.text hot_info ['url'] = href hot_info [' text'] = text if index > 0: div = hot.find_element (by=By.CLASS_NAME, value='GsuT_hjh') if div is not None: hot_value = div.find_element (by=By.TAG_NAME) Value='span'). Text hot_info ['value'] = hot_value hot_infos.append (hot_info) index = index + 1 print (hot_infos) 2. Get real short video url

Open the url of a single hot video, and parse the playback url of the real short video, as shown below:

Def open_video_html (self, url): "" Open specific video page "" self.__driver.get (url=url) time.sleep (1) self.close_popup_window () # close the pop-up window video = self.__driver.find_element (by=By.TAG_NAME, value='video') source = video.find_element (by=By.TAG_NAME, value='source') src = source.get_attribute ('src') return src3. Download video

After obtaining the real url, you can download it, as shown below:

Def download_video (self, url, video_name): "" download according to the video source address "if os.path.exists (video_name): # if you have downloaded it again, you do not need to download return else: with open (video_name, 'wb') as fp: fp.write (requests.get (url) .content) 4. Close the pop-up login window

During crawling, the mask window that needs to log in often pops up and needs to be closed, as follows:

Def close_popup_window (self): try: login = self.__driver.find_element (by=By.ID, value='login-pannel') if login is not None: login.find_element (by=By.CLASS_NAME, value='dy-account-close'). Click () except BaseException ase: pass try: login = self.__driver.find_element (by=By.CLASS_NAME Value='GaDkStRD') if login is not None: btns = login.find_elements (by=By.TAG_NAME, value='button') for btn in btns: if btn.text = = 'cancel': btn.click () break except BaseException ase: pass5. Save the log

After the crawling is successful, save the relevant contents of the crawled short video, as shown below:

Def save_data (self, hot_infos): "" Save data: param res_list: saved content file: return: "t = time.strftime ("% Y-%m-%d ", time.localtime ()) with open (f'logs [{t}] .json', 'averse, encoding='utf-8') as f: res_list_json = json.dumps (hot_infos) Ensure_ascii=False) f.write (res_list_json) sample screenshot

After the program development is complete, the running example is as follows:

The crawled video is saved in the download directory, as shown below:

This is the end of the content of "how to achieve short video hotspot crawling in Python+Selenium". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report