In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces how to use selenium+chromedriver+xpath to crawl dynamic loading information, the article is very detailed, has a certain reference value, interested friends must read it!
Using selenium to achieve dynamic rendering page crawling, selenium is a browser automation testing framework, is a tool for Web application testing, you can run directly in the browser, and can drive the browser to perform specified actions, such as click, drop-down, fill in data, delete cookie and other operations, but also can get the source code of the browser's current page, just like the user operates in the browser. The browsers supported by the tool are IE browser, Mozilla Firefox and Google Chrome.
Install the selenium module
First open the Anaconda Prompt (Anaconda) command line window, then enter the "pip install selenium" command (if you do not have Anaconda installed, you can execute the command to install the module in the cmd command line window), and then press the (enter) key, as shown below:
State clearly
Selenium is available in many languages, such as Java, Ruby, Python, etc.
Download browser driver
After the installation of the selenium module, you still need to select a browser, and then download the corresponding browser driver, then you can control the operation of the browser through the selenium module. Here select the Chrome browser Version 98.0.4758.80 (Official Build) (x86 browser 64), and then download the browser driver in the http://chromedriver.storage.googleapis.com/index.html?path=98.0.4758.80/) Google browser driver. As shown below:
State clearly
When downloading the Google browser driver, download the corresponding browser driver according to your computer system.
The use of selenium module
After downloading the Google browser driver, drag and drop the file named chromedriver.exe to the / usr/bin directory (the sibling path of the python.exe file). Then you need to load the Google browser driver through Python code so that you can start the browser driver and control the browser.
There are different driver for different browsers. Different browsers and their corresponding driver are listed below, as shown in the table below:
BrowersDriverLinkChromeChromedriver (.exe) http://chromedriver.storage.googleapis.com/index.htmlInternet ExplorerIEDriverServer.exe http://selenium-release.storage.googleapis.com/index.htmlEdgeMicrosoftWebDriver.msihttp://go.microsoft.com/fwlink/?LinkId=619687Firefoxgeckodriver(.exe)https://github.com/mozilla/geckodriver/releases/PhantomJSphantomjs(.exe)http://phantomjs.org/Operaoperadriver(.exe)https://github.com/operasoftware/operachromiumdriver/releasesSafariSafariDriver.safariextzhttp://selenium-release.storage.googleapis.com/index.html
Get the product information of JD.com. The sample code is as follows:
# _ * _ coding:utf-8_*_# author: liuxiaowei# creation time: 2-7-22 6:43 PM# file: get JD.com commodity information .py # IDE: PyCharmfrom selenium import webdriver # Import browser driver module from selenium.webdriver.support.wait import WebDriverWait # Import waiting class from selenium.webdriver.support import expected_conditions as EC # waiting condition from selenium.webdriver .common.by import By # Node location # from selenium.webdriver.chrome.service import Servicetry: # create Google browser driver parameter object chrome_options = webdriver.ChromeOptions () # do not load images prefs = {"profile.managed_default_content_settings.images": 2} chrome_options.add_experimental_option ("prefs") Prefs) # use headless UI browser mode chrome_options.add_argument ('--headless') chrome_options.add_argument ('--disable-gpu') # load Google browser driver driver = webdriver.Chrome (options = chrome_options, executable_path='chromedriver') # request address driver.get ('https://item.jd.com/12353915.html') wait = WebDriverWait (driver) 10) # wait 10 seconds # wait for the page to load the node whose class name is m-item-inner This node contains commodity information wait.until (EC.presence_of_element_located ((By.CLASS_NAME, "w") # get all div nodes in the name node name_div1 = driver.find_element (By.XPATH,'//div [@ class= "sku-name"]') name_div2 = driver.find_element (By.XPATH) '/ / div [@ class= "news"] / div [@ class= "item hide"]') name_div3 = driver.find_element (By.XPATH,'/ div [@ class= "p-author"]') summary_price = driver.find_element (By.XPATH '/ / div [@ class= "summary-price J-summary-price]') print (') the extracted product title is as follows:') print (name_div1.text) # print the product title print ('the extracted product slogan is as follows:') print (name_div2.text) # print the slogan print ('extracted editorial information is as follows:') Print (name_div3.text) # print editor information print ('extracted price information is as follows:') print (summary_price.text.strip ('price reduction notice')) # print price information driver.quit () # exit browser driver except Exception as e: print ('display exception information!' , e)
The running result of the program is as follows:
The title of the extracted product is as follows:
Zero basic Python (Python3.9 full color version) (programming entry project practice synchronization video)
The product slogans extracted are as follows:
Color codes are easier to learn. Python programming from introduction to practice books, web crawlers, game development, data analysis and other in-depth learning. Gift full video + source code + after-class questions + physical wallchart + learning application map + e-book + book answer
The extracted editing information is as follows:
Tomorrow's science and technology book
The extracted price information is as follows:
JD.com price
¥72.00 [9.03% discount] [pricing ¥79.80]
Common methods of selenium Module
The selenium module supports a variety of methods to obtain web page nodes, among which the more commonly used methods are as follows:
Common method and description of obtaining Web Page Node by selenium Module
The common method describes that driver.find_element_by_id () acquires the node according to id, the parameter is the value corresponding to the character type id driver.find_element_by_name (), the parameter is the value corresponding to the character type name driver.find_element_by_xpath (), the parameter is the value driver.find_element_by_xpath () corresponding to the character type XPATH, and the parameter is the value driver.find_element_by_link_text () corresponding to the character type driver.find_element_by_link_text () gets the node according to the link text. Parameter is character type link text driver.find_element_by_tag_name () gets node according to node name, parameter is character type node text driver.find_element_by_class_name () gets node according to class, parameter is value corresponding to character type class driver.find_element_by_css_selector () gets node according to CSS selector, parameter is CSS selector syntax of character type
State clearly
All the methods in the above table are to obtain a single node. If you need to obtain multiple nodes that meet the criteria, you can add s after element in the corresponding method.
In addition to the above commonly used methods of getting nodes, you can also use the driver.find_element () method to get a single node and the driver.find_elements () method to get multiple nodes. It's just that when you call both methods, you need to specify by and value parameters for them. Where the by parameter indicates how to get the node, and value refers to the value corresponding to the acquisition method (which can be understood as a condition). The sample code is as follows:
# get all div nodes in the commodity information node name_div = driver.find_element (By.XPATH,'//div [@ class= "itemInfo-wrap"]') .find_elements (By.TAG_NAME 'div') # extract and output the content print of a single div node (' the extracted item title is as follows:') print (name_div [0] .text) # print the product title print ('the extracted product slogan is as follows:') # print the product slogan print (name_div [1] .text)
The running result of the program is as follows:
The title of the extracted product is as follows:
Zero basic Python (Python3.9 full color version) (programming entry project practice synchronization video)
The product slogans extracted are as follows:
Color codes are easier to learn. Python programming from introduction to practice books, web crawlers, game development, data analysis and other in-depth learning. Gift full video + source code + after-class questions + physical wallchart + learning application map + e-book + book answer
Tomorrow's science and technology book
State clearly
In the above code, we first use the find_element () method to get the entire node with a class value of "itemInfo-warp", then get all the nodes with the node name div in that node through the find_elements () method, and finally get the text information in the first and second div of all div through name_div [0] .text, name_div [1] .text.
Here are the other properties and uses of By
By attribute usage By.ID means to get the corresponding single or multiple nodes according to the ID value By.LINK_TEXT means to obtain the corresponding single or multiple nodes according to the link text By.PARTIAL_LINK_TEXT represents to obtain the corresponding single or multiple nodes according to the partial link text By.NAME obtains the corresponding single or multiple nodes according to the name By.TAG_NAME obtains the single or multiple nodes according to the node name By.CLASS_NAME acquires single or multiple nodes based on class value By.CSS_SELECTOR acquires single or multiple nodes based on CSS selector The corresponding value is the position of the string CSS By.XPATH obtains one or more nodes according to the By.XPATH, and the corresponding value string node location
When you use the selenium module to get the value corresponding to an attribute in a node, you can use the get_attribute () method. The sample code is as follows:
# obtain the href address href = driver.find_element (By.XPATH,'/ / div [@ id= "p-author"] / a'). Get_attribute ('href') print (' the address information in the specified node is as follows:') according to the XPath location
The running result of the program is as follows:
The address information in the specified node is as follows:
Https://book.jd.com/writer/%E6%98%8E%E6%97%A5%E7%A7%91%E6%8A%80_1.html
Total knot
The important thing to note in this case is to load the browser driver and be sure to specify the path to the chromedriver. The syntax is as follows:
# load Google browser driver driver = webdriver.Chrome (options = chrome_options) Executable_path='chromedriver') # the driver in this example is in the same way as the crawler Diameter
Close the browser page
Driver.close (): close the current page driver.quit (): exit the entire browser above is "how to use selenium+chromedriver+xpath to crawl dynamic loading information" all the content of this article, thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.