Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to implement Douyin Hot search timing crawling function by Python

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge points of this article, "Python how to achieve Douyin hot search time crawling function", so the editor summarizes the following contents, detailed contents, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "Python how to achieve Douyin hot search time crawling function" article.

Hot search list of Douyin

The whole hot list has a total of 50 pieces of data, the content of this crawl: ranking, heat, title, link.

Requests crawling

Requests is a very simple method, and since there is no anti-crawling measure for the page, you can simply get the page.

Import requestsimport pandas as pdheaders = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'} url= 'https://tophub.today/n/K7GdaMgdQy'page_text = requests.get (url=url, headers=headers) .textpage_text

As you can see, the data is easily obtained with just a few lines of code.

Selenium crawling

Set selenium to a headless browser and open the specified url to get the page data.

From selenium import webdriveroption = webdriver.ChromeOptions () option.add_argument ('--headless') driver = webdriver.Chrome (options=option) url = 'https://tophub.today/n/K7GdaMgdQy'driver.get(url)page_text = driver.page_source

Both crawl methods can successfully obtain data, but requests is relatively simple, the whole code runs faster, if the page data is not dynamically loaded, it is relatively convenient to use requests.

Data parsing

Now use the lxml library to parse the data we crawled and save it to excel.

Tree = etree.HTML (page_text) tr_list = tree.xpath ('/ / * [@ id= "page"] / div [2] / div [2] / div [1] / div [2] / div/div [1] / table/tbody/tr') df = pd.DataFrame (columns= ['ranking', 'hot', 'title', 'link']) for index Tr in enumerate (tr_list): hot = tr.xpath ('. / td [3] / text ()') [0] title = tr.xpath ('. / td [2] / a/text ()') [0] article_url = tr.xpath ('. / td [2] / a Compact ref') [0] df = df.append ({'Rank': index + 1, 'Heat': hot, 'title': title 'link': article_url}, ignore_index=True) df ['link'] = 'https://tophub.today' + df [' link'] df

Running result

Set timing to run

At this point, the crawling code is complete, and if you want to run the code automatically every hour, you can use the task scheduler.

Open Task Scheduler, [create Task]

Enter the name, and the name can be picked up at will.

Select "trigger" > > "New" > > "set trigger time"

Select * * Operation * *-> * * New * *-> > * * Select Program * *

It can be confirmed finally. It will run automatically when the time comes, or right-click the task to run manually.

The above is about the content of this article on "how to achieve Douyin hot search time crawling function", I believe we all have a certain understanding, I hope the content shared by the editor will be helpful to you, if you want to know more related knowledge content, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report