In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what is the method of Python Selenium automatic crawler". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the method of Python Selenium automatic crawler"?
A brief introduction:
Selenium is an automated testing tool for Web, originally developed for automated testing of websites. Selenium can run directly on browsers. It supports all major browsers (including PhantomJS, which has no interface (developers said in 2018 to suspend development, and chromedriver can achieve the same function). It can receive instructions and let browsers load pages automatically, get the data they need, and even take screenshots of pages.
1. Install pip install selenium-I https://pypi.tuna.tsinghua.edu.cn/simple2. Download browser driver
The Google browser used here
Http://npm.taobao.org/mirrors/chromedriver/
Check your browser version to download the corresponding driver.
Put the unzipped driver in your python.exe directory.
3. Example 3.1 download the corresponding version of the browser driver
Http://npm.taobao.org/mirrors/chromedriver/
Put the unzipped driver in your python.exe directory
3.2 Test code, open a web page, and get the title of the page from selenium.webdriver import Chromeif _ _ name__ = ='_ _ main__': web = Chrome () web.get ("https://baidu.com") print (web.title))
3.3 A small sample from selenium.webdriver import Chromeif _ _ name__ = ='_ main__': web = Chrome () url = 'https://ac.nowcoder.com/acm/home' web.get (url) # get the a tag el = web.find_element_by_xpath (' / html/body/div/div [3] / div [1] / div/a) to click on ') # Click el.click () # "/ html/body/div/div [3] / div [1] / div [2] / div [1] / h5Unip a" # crawl the desired content lists = web.find_elements_by_xpath ("/ html/body/div/div [3] / div [1] / div [2] / Div [@ class='platform-item js-item'] / div ["" 2] / div [1] / h5i.text a ") print (len (lists)) for i in lists: print (i.text) 3.4 automatically enter and jump to from selenium.webdriver import Chromefrom selenium.webdriver.common.keys import Keysimport timeif _ _ name__ = ='_ main__': web = Chrome () url = 'https://ac.nowcoder.com/acm/home' web.get (url) el = web.find_element_by_xpath (' / html/body/div/div [3] / div [1] / div/a') el.click () time.sleep (1) input_el = web.find_element_by_xpath ('/ html/body/div/div [3) ] / div [1] / form/input [1]') input_el.send_keys ('Niuke' Keys.ENTER) # do something4. Turn on headless mode
Whether to turn on headless mode (that is, whether an interface is required)
From selenium.webdriver import Chromefrom selenium.webdriver.chrome.options import Optionsoption = Options () # instantiate the option object option.add_argument ("--headless") # add the headless parameter if _ _ name__ = ='_ main__': web = Chrome (executable_path='D:\ PyProject\ spider\ venv\ Scripts\ chromedriver.exe',options=option) # to the option object, otherwise find it from the python interpreter directory. Web.get ("https://baidu.com") print (web.title) 5. Save page screenshot from selenium.webdriver import Chromefrom selenium.webdriver.chrome.options import Optionsoption = Options () # instantiate option object option.add_argument ("--headless") # add headless parameter if _ _ name__ = ='_ main__': web = Chrome () web.maximize_window () # browser window maximization web.get ("https://baidu.com") print") (web.title) web.save_screenshot ('baidu.png') # Save a screenshot of the current page to the current folder web.close () # close the current page 6. Simulate input and click from selenium.webdriver import Chromefrom selenium.webdriver.chrome.options import Optionsoption = Options () # instantiate option object option.add_argument ("--headless") # add headless parameter if _ _ name__ = ='_ main__': web = Chrome () web.maximize_window () # browser window maximize web.get ("https://baidu.com")") El = web.find_element_by_id ('kw') el.send_keys (' Harris-H') btn = web.find_element_by_id ('su') btn.click () # web.close () # close the current web page
It seems that Baidu can now identify selenium and still need picture verification.
6.1Lookup the node according to the text value # find the node whose text value is below Baidu driver.find_element_by_link_text ("Baidu below") # get the element list according to the text contained in the link Fuzzy matching driver.find_elements_by_partial_link_text ("degree") 6.2get the text of the current node ele.text # get the text of the current node ele.get_attribute ("data-click") # get the value6.3 corresponding to the attribute print some information of the current web page print (driver.page_source) # print the source code of the web page print (driver.get_cookies ()) # print Cookieprint (driver.current_url) # url6.4 printing out the current web page close browser driver.close () # close current web page driver.close () # close current web page driver.quit () # directly close browser 6.5 simulate mouse scrolling from selenium.webdriver import Chromeimport timeif _ _ name__ = ='_ main__': driver = Chrome () driver.get ( "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=78000241_12_hao_pg&wd=selenium%20js%E6%BB%91%E5%8A%A8&fenlei=256&rsv_pq=8215ec3a00127601&rsv_t=a763fm%2F7SHtPeSVYKeWnxKwKBisdp%2FBe8pVsIapxTsrlUnas7%2F7Hoo6FnDp6WsslfyiRc3iKxP2s&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=31&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_btype=i&inputT=9266&rsv_sug4=9770") # 1. Scroll to the bottom of the page js = "document.documentElement.scrollTop=1000" # execute js driver.execute_script (js) time.sleep (2) # scroll to the top js = "document.documentElement.scrollTop=0" driver.execute_script (js) # execute js time.sleep (2) driver.close () 7.ChromeOptionsoptions = webdriver.ChromeOptions () options.add_argument ("--proxy-server= http://110.52. 235.176 lang=en-US 9999 ") # add proxy options.add_argument ("-- headless ") # headless mode options.add_argument ("-- lang=en-US ") # Web page shows English prefs = {" profile.managed_default_content_settings.images ": 2 'permissions.default.stylesheet': 2} # forbids rendering options.add_experimental_option ("prefs", prefs) driver = webdriver.Chrome (executable_path= "D:\ ProgramApp\ chromedriver\ chromedriver73.exe", chrome_options=options) driver.get ("http://httpbin.org/ip")8.") Verify slider movement
Target: sliding CAPTCHA
1. Positioning button
two。 Press and hold the slider
3. Slide button
Import timefrom selenium import webdriverif _ _ name__ ='_ _ main__': chrome_obj = webdriver.Chrome () chrome_obj.get ('https://www.helloweba.net/demo/2017/unlock/') # 1. Locate the slide button click_obj = chrome_obj.find_element_by_xpath ('/ / div [@ class= "bar1 bar"] / div [@ class= "slide-to-unlock-handle"]') # 2. Press and hold # to create an action chain object. The parameter is browser object action_obj = webdriver.ActionChains (chrome_obj) # Click and hold down The parameter is the positioning button action_obj.click_and_hold (click_obj) # to get its width and height size_ = click_obj.size width_ = 298-size_ ['width'] # the width of the slider minus the width of the slider is the distance to the x-axis (to the right) print (width_) # 3. Locate the sliding coordinates action_obj.move_by_offset (298 words widthpieces, 0). Perform () # 4. Release slide action_obj.release () time.sleep (6) chrome_obj.quit () 9. Open multiple windows and page switching
Sometimes there are many child tab pages in the window. It must be necessary to switch at this time. Selenium provides a switch called switch_to_window. Which page to switch to can be found in driver.window_handles.
From selenium import webdriverif _ name__ ='_ main__': driver = webdriver.Chrome () driver.get ("https://www.baidu.com/") driver.implicitly_wait (2) driver.execute_script (" window.open ('https://www.douban.com/')") driver.switch_to.window (driver.window_handles [1]) print (driver.page_source) 10.Cookie operation # 1. Get all cookie:for cookie in driver.get_cookies (): print (cookie) # 2. Get value:value = driver.get_cookie (key) # 3 according to cookie's key. Delete all cookie:driver.delete_all_cookies () # 4. Delete a cookie:driver.delete_cookie (key) # add cookie:driver.add_cookie ({"name": "password", "value": "111111"}) 11. Simulated login
Here is a simulated login to the academic Affairs Office of our school:
From selenium.webdriver import Chromeif _ _ name__ ='_ _ main__': web = Chrome () web.get ('http://bkjx.wust.edu.cn/') username = web.find_element_by_id (' userAccount') username.send_keys ('xxxxxxx') # here fill in your student ID password = web.find_element_by_id (' userPassword') password.send_keys ('xxxxxxx') # enter your own password btn = web.find_element_by_xpath ('/ / * [@ id= "ul1"] / li [4] / button') btn.click () # do something
Because there is no verification of sliders or anything, it is very simple to qwq. Then you can do your own operation later.
twelve。 Advantages and disadvantages
Selenium can execute the js on the page, and it is very easy to deal with js rendered data and simulated login.
Selenium is very inefficient because it sends a lot of requests during the process of getting the page, so it needs to be used as appropriate in many cases.
At this point, I believe that you have a deeper understanding of "what is the method of Python Selenium automatic crawler". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.