In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article is about how to use python web crawler to crawl Douyu TV information based on selenium. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
I. the third-party packages and tools used in this article
Python 3.8
Google browser
Selenium (3.141.0) (pip install selenium = = 3.141.0) Note the differences between Series 4.0 and Series 3.0 methods
Browser driver (corresponding to your browser version)
Second, the introduction of selenium and browser-driven installation of 1.selenium
Selenium is an automated testing tool for web, which can easily simulate real users to operate on the browser. It supports a variety of mainstream browsers: IE, Chrome, Firefox, Safari, Opera and so on. You can use selenium for web testing or crawlers, automatic ticket grabbing and automatic order placing can also be done with selenium.
two。 Installation of browser driver
There are many methods on the Internet, friends search by themselves, here is a note: this article uses Google browser, browser driver should correspond to Google's, note that browser driver corresponds to your browser version, here is the download address of Google browser driver, friends can correspond to their own browser version now
Http://chromedriver.storage.googleapis.com/index.html
After the download is completed, pay attention to configure the environment variables. If you do not configure it, write the path of your Chromedriver.exe file in the code or not write the path and put your Chromedriver.exe and py files in the same directory.
Third, the analysis of code ideas.
Enter Douyu's official website and click on the live broadcast. The following online live broadcast information is what we need to crawl.
You can see that there are title, type, name, and heat. We can crawl through these four fields.
Then slide to the bottom, and the next page here is where we control the number of crawled pages.
Note: when we enter the page, although there is a scroll bar, all the LVB information has been loaded, not by sliding and then loading by Ajax, so there is no need to write and slide in the code, you can directly extract the data of the entire page.
1. Function for parsing data # function for parsing data def parse (self): # forcing you to wait for two seconds Wait for the page data to be loaded sleep (2) li_list = self.bro.find_elements_by_xpath ('/ / * [@ id= "listAll"] / section [2] / div [2] / ul/li') # print (len (li_list)) data_list = [] for li in li_list: dic_data = {} dic_ Data ['title'] = li.find_element_by_xpath ('. / div/a/div [2] / div [1] / h4'). Text dic_data ['name'] = li.find_element_by_xpath ('. / div/a/div [2] / div [2] / h3div div'). Text dic_data ['art_type'] = li.find_element_by_xpath ('. / div/a/) Div [2] / div [1] / span'). Text dic_data ['hot'] = li.find_element_by_xpath ('. / div/a/div [2] / div [2] / span'). Text data_list.append (dic_data) return data_list2. Function to save data # function to parse data def parse (self): # force wait for two seconds Wait for the page data to be loaded sleep (2) li_list = self.bro.find_elements_by_xpath ('/ / * [@ id= "listAll"] / section [2] / div [2] / ul/li') # print (len (li_list)) data_list = [] for li in li_list: dic_data = {} dic_ Data ['title'] = li.find_element_by_xpath ('. / div/a/div [2] / div [1] / h4'). Text dic_data ['name'] = li.find_element_by_xpath ('. / div/a/div [2] / div [2] / h3div div'). Text dic_data ['art_type'] = li.find_element_by_xpath ('. / div/a/) Div [2] / div [1] / span'). Text dic_data ['hot'] = li.find_element_by_xpath ('. / div/a/div [2] / div [2] / span'). Text data_list.append (dic_data) return data_list
(1) Save as txt text
# function def save_data (self,data_list,i) to save data: # Save data as txt file with open ('. / douyu.txt','w') in the current directory Encoding='utf-8') as fp: for data in data_list: data = str (data) fp.write (data+'\ n') print ("page d saved complete!" % I)
(2) Save as json file
# the function def save_data (self,data_list,i): with open ('. / douyu.json','w',encoding='utf-8') as fp: # contains Chinese, so notice that ensure_ascii=False data = json.dumps (data_list,ensure_ascii=False) fp.write (data) print ("page% d saved complete!" % I) 3. Main function design # main function def run (self): # enter the number of pages to crawl, if you enter a negative integer Convert it to her absolute value page_num = abs (int (input ("Please enter the number of pages you want to climb:")) # initialize the number of pages to 1 I = 1 # determine whether the number entered is an integer if isinstance (page_num Int): # instantiate browser object self.bro = webdriver.Chrome (executable_path='../../ executable / chromedriver.exe') # chromedriver.exe if added to the environment variable The executable_path='../../ executable file / chromedriver.exe' self.bro.get (self.url) while I can be omitted
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.