In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
Most people do not understand the knowledge points of this article, "Python how to achieve Douyin hot search time crawling function", so the editor summarizes the following contents, detailed contents, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "Python how to achieve Douyin hot search time crawling function" article.
Hot search list of Douyin
The whole hot list has a total of 50 pieces of data, the content of this crawl: ranking, heat, title, link.
Requests crawling
Requests is a very simple method, and since there is no anti-crawling measure for the page, you can simply get the page.
Import requestsimport pandas as pdheaders = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'} url= 'https://tophub.today/n/K7GdaMgdQy'page_text = requests.get (url=url, headers=headers) .textpage_text
As you can see, the data is easily obtained with just a few lines of code.
Selenium crawling
Set selenium to a headless browser and open the specified url to get the page data.
From selenium import webdriveroption = webdriver.ChromeOptions () option.add_argument ('--headless') driver = webdriver.Chrome (options=option) url = 'https://tophub.today/n/K7GdaMgdQy'driver.get(url)page_text = driver.page_source
Both crawl methods can successfully obtain data, but requests is relatively simple, the whole code runs faster, if the page data is not dynamically loaded, it is relatively convenient to use requests.
Data parsing
Now use the lxml library to parse the data we crawled and save it to excel.
Tree = etree.HTML (page_text) tr_list = tree.xpath ('/ / * [@ id= "page"] / div [2] / div [2] / div [1] / div [2] / div/div [1] / table/tbody/tr') df = pd.DataFrame (columns= ['ranking', 'hot', 'title', 'link']) for index Tr in enumerate (tr_list): hot = tr.xpath ('. / td [3] / text ()') [0] title = tr.xpath ('. / td [2] / a/text ()') [0] article_url = tr.xpath ('. / td [2] / a Compact ref') [0] df = df.append ({'Rank': index + 1, 'Heat': hot, 'title': title 'link': article_url}, ignore_index=True) df ['link'] = 'https://tophub.today' + df [' link'] df
Running result
Set timing to run
At this point, the crawling code is complete, and if you want to run the code automatically every hour, you can use the task scheduler.
Open Task Scheduler, [create Task]
Enter the name, and the name can be picked up at will.
Select "trigger" > > "New" > > "set trigger time"
Select * * Operation * *-> * * New * *-> > * * Select Program * *
It can be confirmed finally. It will run automatically when the time comes, or right-click the task to run manually.
The above is about the content of this article on "how to achieve Douyin hot search time crawling function", I believe we all have a certain understanding, I hope the content shared by the editor will be helpful to you, if you want to know more related knowledge content, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.