How to use python+selenium for crawler operation 04/19 Update SLTechnology News&Howtos

How to use python+selenium for crawler operation

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use python+selenium for crawler operation", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use python+selenium for crawler operation" bar!

1. Declare browser objects

Selenium supports multiple browsers, as well as mobile browsers, as well as Phantomjs. Here's a simple example of creating a Google browser object, and so on.

From selenium import webdriver

Chrome_driver = "C:\ Users\ zhongchengbin\ Documents\ chromedriver\ chromedriver.exe" (note to download the corresponding chrome version, set the path) browser = webdriver.Chrome (executable_path=chrome_driver)

two。 Request page

# use get method to request Baidu web page

Browser.get ('https://www.baidu.com')

The # page_source attribute is used to get the source code of the web page, and then you can use regular expressions, css,xpath,bs4, to parse the web page

Print (browser.page_source)

Browser.close ()

3. Find a single node and multiple nodes

There are many ways to find a single node, and the result returned is of type WebElement

Browser.find_element_by_id ()

Browser.find_element_by_name ()

Browser.find_element_by_xpath ()

Browser.find_element_by_tag_name ()

Browser.find_element_by_link_text ()

Browser.find_element_by_class_name ()

Browser.find_element_by_css_selector ()

Browser.find_element_by_partial_link_text ()

If you are looking for multiple nodes, add s after element, and the result is a list type.

When we want to locate Baidu's search box, we can see that there are class,id and other attributes in the input node in the check element. By locating the relevant attributes, we can locate the desired node.

From selenium import webdriver

Chrome_driver = "C:\ Users\ zhongchengbin\ Documents\ chromedriver\ chromedriver.exe" (note to download the corresponding chrome version, set the path)

Browser = webdriver.Chrome (executable_path=chrome_driver)

Browser.get ('https://www.baidu.com')

Input = browser.find_element_by_id ('kw')

Browser.close ()

4 simulate the operation of browser

After opening the browser, we often need to enter text in some search boxes, delete text, click some buttons, etc., and then we need to use the following methods

Send_keys (): enter text

Clear (): clear text

Click (): click the button

For example, let's open the Baidu browser, enter some keywords, delete, type some keywords again, then click enter and search.

Import time

From selenium import webdriver

Chrome_driver = "C:\ Users\ zhongchengbin\ Documents\ chromedriver\ chromedriver.exe" (note to download the corresponding chrome version, set the path)

Browser = webdriver.Chrome (executable_path=chrome_driver)

Browser.get ('https://www.baidu.com')

Input = browser.find_element_by_id ('kw')

Input.send_keys ('Xu Song')

Time.sleep (3)

Input.clear ()

Input.send_keys ('python')

Input.send_keys (Keys.ENTER)

# button = browser.find_element_by_class_name ('btn self_btn')

# button.click ()

Browser.close ()

5. Simulate mouse movement, keyboard keys, etc. there is no specific operation of the object.

When simulating a browser, we may use some drag operations, such as the need to drag a point to another place, which we can call an action chain.

First open an instance in the web page http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable, select the node to drag and where to drag, use the ActionChains object to make it a variable, and then call the drag_and_drop method and the perform method to perform the process.

From selenium import webdriver

From selenium.webdriver import ActionChains

Browser = webdriver.Chrome ()

Browser.get ('http:www.runoob.com/try/try.php?filename=jqueryui-api-droppable')

Browser.switch_to.frame ('iframeResult')

Yuanlai = browser.find_element_by_css_selector ('# draggable')

Mubiao = browser.find_element_by_css_selector ('# droppable')

A = ActionChains (browser)

A.drag_and_drop (yuanlai,mubiao)

A.perform ()

6. Drag the slider bar

When you use a crawler to crawl a web page, you will often see a page that uses a slider, and then it will show that it is loading, and it will not be long before a new page is loaded. If you crawl directly, you can often only climb to one page or even the first few pages of information.

Here you can directly simulate running JavaScript and slide to the bottom of the page. Use execute_script (), and you can set a web prompt box

From selenium import webdriver

Browser = webdriver.Chrome ()

Browser.get ('https://www.toutiao.com/search/?keyword= Street shooting')

# execute_script () pull the progress bar to the bottom, and then pop up the prompt box

Browser.execute_script ('window.scrollTo (0document. Body.scrollHeight)')

Browser.execute_script ('alert ("at the bottom"))

7. Get node information

There are two ways to obtain node information. The first is to use page_source attribute to obtain the source code of the target web page, and then use regular expressions, css,xpath, bs4 and other tools to crawl information. The second is to use some of the methods and properties of selenium directly.

① get attribute

First select the node to be parsed, and then call the get_attribute method to get the attributes of the node.

From selenium import webdriver

From selenium.webdriver import ActionChains

Browser = webdriver.Chrome ()

Browser.get ('https://www.baidu.com')

Input = browser.find_element_by_id ('kw')

Print (input)

Print (input.get_attribute ('class'))

② gets text

Still open the URL first, then navigate to the target node, and then use the text attribute to get the text

From selenium import webdriver

From selenium.webdriver import ActionChains

Browser = webdriver.Chrome ()

Browser.get ('https://www.baidu.com')

Input = browser.find_element_by_id ('kw')

Print (input)

Print (input.text)

③ gets ID, location, tag signature and size

Method is similar to the first two, and the relevant values can be obtained by calling the method directly

From selenium import webdriver

From selenium.webdriver import ActionChains

Browser = webdriver.Chrome ()

Browser.get ('https://www.baidu.com')

Input = browser.find_element_by_id ('kw')

Print (input)

# get Node id

Print (input.id)

# get the relative position of the node on the page

Print (input.location)

# get the node label name

Print (input.tag_name)

# get node size

Print (input.size)

8. Toggle frame

Web pages have a very common node called iframe, which is equivalent to the child frame, that is, the child pages of the page.

The structure is consistent with the structure of the external web page. When we use selenium to open the page, it is executed in the parent page by default. However, for this kind of web page, we often cannot get the node of the child page, so we have to use switch_to.frame () to jump to the page, and then do the corresponding operation.

Then try to get a node of the parent page, if not, report an error, and then change back to the parent page to get the node, and it will be successful.

From selenium import webdriver

From selenium.common.exceptions import NoSuchElementException

Browser = webdriver.Chrome

Browser.get ('https://www.toutiao.com')

What is browser.switch_to.frame ('tanxssp-tuwen-iframemm') # depends on what the id of node frame is?

Try:

Title = browser.find_element_by_id ('tanx-a-mm_32479643_3494618_81668314')

Except NoSuchElementException:

Print ('no node')

Browser.switch_to.parent_frame ()

Title = browser.find_element_by_id ('tanx-a-mm_32479643_3494618_81668314')

Print (title)

Print (title.text)

9. Delay waiting

Sometimes the page has extra Ajax requests that don't load so quickly, so we have to wait patiently.

There are two kinds of waiting, implicit waiting and explicit waiting.

Implicit wait, when looking for a node, implicit wait will wait for a fixed time, if the time you do not come, he will lose his temper, and then report to the superior that you did not show up.

From selenium import webdriver

Browser = webdriver.Chrome

Browser.implicitly_wait (10)

Browser.get ('https://www.baidu.com')

Input = browser.find_element_by_id ('kw')

Print (input)

Explicit waiting appears to be more humane and can be a human being. When she is ready to wait for you, she will warmly remind you how long she can wait for you, and if you arrive within the specified time, then go on to do other things together. If you don't come over time, you have no choice but to report the problem to your superiors.

# explicit wait

From selenium import webdriver

From selenium.webdriver.common.by import By

From selenium.webdriver.support.ui import WebDriverWait

From selenium.webdriver.support import expected_conditions as EC

Browser = webdriver.Chrome

Browser = get ('https://www.taobao.com')

# introduce a WWebDriverWait object to specify the maximum waiting time

Wait = WebDriverWait (browser,10)

# call the unti method, and then set the wait condition to indicate what it means until this node appears. The parameter is the node's positioning tuple, that is, the node search box with an ID of Q

# return if you succeed, or throw an exception if you fail

Input = wait.until (EC,presence_of_element_located ((By.ID,'q')

Button = wait.until (EC,element_to_be_clickable ((By.CSS_SELECTOR,'btn-search')

Print (input,button)

All the waiting conditions

The title of title_is is something.

The title_contains title contains something.

The presence_of_element_located node is loaded and passed in the location tuple

Visibility_of_element_located node visible, passing in location tuple

Visibility_of visible, incoming node object

All nodes of presence_of_all_element_located are loaded.

Text_to_be_present_in_element A node text contains some text

Text_to_be_present_in_element_value A node value contains some text

Frame_to_be_available_and_switch_to_it loads and switches

Invisibility_of_element_located node is not visible

Element_to_be_clickable node clickable

Staleness_of determines whether a node is in DOM, and can determine whether the page has been refreshed.

Element_to_be_selected node selectable, pass node object

Element_located_to_be_selected node is selectable, pass location tuple

Element_selection_state_to_be passes in the node object and the state. Equal returns true, otherwise returns false.

Element_located_selection_state_to_be passes in location tuple and status. Equal returns true, otherwise returns false.

Is there a warning on alert_is_present?

10. Go back and forward.

All browsers have the function of going forward and backward. In selenium, back () means backward, and forward () means forward.

# go back and forward

Import time

From selenium import webdriver

Browser = webdriver.Chrome ()

Browser. Get ('https://www.baidu.com')

Browser.back ()

Time.sleep (3)

Browser.forward ()

Browser.close ()

11. Get cookies

From selenium import webdriver

Browser = webdriver.Chrome ()

Browser.get ('https://www.baidu.com')

Print (browser.get_cookies ())

Browser.add_cookie ({'name':'name','domain':'www.baidu.com','value':'germey'})

Print (browser.get_cookies ())

Browser.delete_all_cookies ()

Print (browser.get_cookies ())

twelve。 Tab management

Import time

From selenium import webdriver

Browser = webdriver.Chrome ()

Browser.get ('https://www.baidu.com')

# window.open () opens a new tab

Browser.execute_script ('window.open ()')

# window_handles is used to get all the tabs that are currently open, and the code list of tabs is returned

Print (browser.window_handles)

# switch_to_window is used to switch tabs

Browser.switch_to_window (browser.window_handles [1])

Browser.get ('https://www.taobao.com')

Time.sleep (3)

Browser.switch_to_window (browser.window_handles [0])

Browser.get ('https://python.org')

Thank you for your reading, the above is the content of "how to use python+selenium for crawler operation". After the study of this article, I believe you have a deeper understanding of how to use python+selenium for crawler operation, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.