Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python Crawler Playbook 1-address the need to crawl the content of a web page after N seconds

2025-03-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Xiaosheng blog: http://xsboke.blog.51cto.com

-Thank you for your reference. If you have any questions, you are welcome to communicate.

Citation:

When the content of the page you need to crawl will not appear until 5 seconds after visiting the page, it is difficult to crawl the content you want using the requests module of python. The difference between requests and selenium: requests realizes browsing the web by simulating http requests. Selenuim controls the browser through the API of the browser, thus achieving browser automation. It is said that selenium accesses the browser by controlling the browser, but the command line of linux cannot open the browser. Fortunately, Chrome and Firefox have a special function headless (headless) mode. (that is, headless mode) you can control the browser on the linux command line through the Wuji mode. The next thing to be realized is to obtain the content after 5 seconds of visiting the web page through the headless mode of Chrome in the linux command line interface.

one。 Environmental preparation

1. Install linux version of Google browser and its dependent packages. [root@localhost] # wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm [root@localhost] # yum localinstall google-chrome-stable_current_x86_64.rpm-- nogpgcheck 2. Download the driver for selenium connection chrome (to download the corresponding version of chrome). [root@localhost] # google-chrome-- version Google Chrome 72.0.3626.109 [root@localhost] # https://chromedriver.storage.googleapis.com/72.0.3626.69/chromedriver_linux64.zip [root@localhost] # unzip chromedriver_linux64.zip [root@localhost] # chmod + x chromedrive 3. Module required to install python # beautifulsoup4 is a html,xml parser [root@localhost ~] # pip install beautifulsoup4 selenium

two。 Start to get the data after 5 seconds of the web page

#! / usr/bin/env python#-*- coding:utf-8-*-import timefrom bs4 import BeautifulSoupfrom selenium import webdriveropt = webdriver.ChromeOptions () # create chrome object opt.add_argument ('--no-sandbox') # enable non-sandboxie mode, linux is required, otherwise an error will be reported: (unknown error: DevToolsActivePort file doesn't exist). Opt.add_argument ('--disable-gpu') # disable gpu Linux deployment needs to be filled to prevent unknown bugopt.add_argument ('headless') # set chrome to wujie mode, no matter windows or linux Automatically adapt the corresponding parameter driver = webdriver.Chrome (executable_path=r'/root/chromedriver',options=opt) # specify the chrome driver location and chrome option driver.get ('https://baidu.com') # visit web page time.sleep (5) # wait 5 seconds content = driver.page_source # page soup = BeautifulSoup (content) Features='html.parser') # convert the acquired content into BeautifulSoup object driver.close () print (soup.body.get_text ()) # the page content obtained by accessing the BeautifulSoup object

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report