Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize the crawler function by Selenium+PhantomJS+python

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to achieve the crawler function in Selenium+PhantomJS+python". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to achieve the crawler function in Selenium+PhantomJS+python".

I. brief introduction

Selenium is a tool for automated testing of Web applications. The test runs directly in the browser, just like a real user is operating.

Selenium2 supports driving real browsers (FirfoxDriver,IternetExplorerDriver,OperaDriver,ChromeDriver)

Selenium2 supports driving an interface-less browser (HtmlUnit,PhantomJs)

II. Installation

Windows

The first method is to download the source code installation, extract it and put the entire directory under C:\ Python27\ Lib\ site-packages

The second method is to directly enter the command under C:\ Python27\ Scripts to install pip install-U selenium.

Sudo pip install selenium

PhantomJS

I. brief introduction

PhantomJS is a server-side JavaScript API based on WebKit (WebKit is an open source browser engine that Chrome,Safari uses). The main application scenarios are: browser-free Web testing, page access automation, screen capture, network monitoring

II. Installation

Windows

Download the source code to install, extract and add the extracted path to the environment variable. I put my own under C:\ Python27\ Scripts

Linux

Sudo apt-get install PhantomJS

Selenium + PhantomJS + python simply realizes the function of crawler

Python can use selenium to execute javascript,selenium, which allows browsers to load pages automatically and get the data they need. Selenium does not come with a browser, it can use a third-party browser such as Firefox,Chrome, or it can be executed in the background using a headless browser such as PhantomJS.

There is a problem in work. When loading a URL on a mobile phone, it will not load. We need to set a User-Agent in the request header, and then we can open it (execute under Windows, then do not need to add executable_path='C:\ Python27\ Scripts\ phantomjs.exe').

From selenium import webdriverfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict (DesiredCapabilities.PHANTOMJS) # set userAgentdcap ["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9) Rv:25.0) Gecko/20100101 Firefox/25.0 ") obj = webdriver.PhantomJS (executable_path='C:\ Python27\ Scripts\ phantomjs.exe',desired_capabilities=dcap) # load URL obj.get ('http://wap.95533pc.com')# open URL obj.save_screenshot (" 1.png ") # screenshot save obj.quit () # close the browser When an exception occurs, remember to close PhantomJS in the task browser, because there will be multiple PhantomJS running, affecting the performance of the computer.

I. timeout setting

There are three time-related methods in the webdriver class:

1.pageLoadTimeout sets the timeout for full loading of the page, that is, the complete rendering is completed, and both synchronous and asynchronous scripts are executed.

2.setScriptTimeout sets the timeout for asynchronous scripts

Intelligent waiting time of objects identified by 3.implicitlyWait

Next, we take the school flower network title as an example to verify the effect, because there are more pictures in the school flower network, so it takes longer to load and take more time to achieve our effect (another reason I won't say, so that we can learn to be energetic, ! )

From selenium import webdriverobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") obj.set_page_load_timeout (5) try: obj.get ('http://www.xiaohuar.com') print obj.titleexcept Exception as e: print e

Second, the positioning of elements

Object positioning is achieved through attribute positioning, which is just like a person's ID card information, or some other information to find the object, so let's introduce some common positioning methods provided by Webdriver.

The above is the input box of Baidu. We can find that we can use id to locate the tag, and then we can do the later operation.

From selenium import webdriverobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") obj.set_page_load_timeout (5) try: obj.get ('http://www.baidu.com') obj.find_element_by_id (' kw') # locate obj.find_element_by_class_name ('kw') through ID) # locate obj.find_element_by_name (' wd') through class attribute # locate obj.find_element_by_tag_name by tag name attribute ('input') # locate obj.find_element_by_css_selector by tag attribute (' # kw') # locate obj.find_element_by_xpath by css ("/ / input [@ id='kw']") # locate obj.find_element_by_link_text by xpath ("Tieba") # pass Locate print obj.find_element_by_id ('kw') .tag_name # through xpath to get the type of tag except Exception as e: print e

III. Operation of the browser

1. The browser started by calling is not full-screen, and sometimes it will affect some of our operations, so we can set up full-screen.

From selenium import webdriverobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") obj.set_page_load_timeout (5) obj.maximize_window () # set full screen try: obj.get ('http://www.baidu.com') obj.save_screenshot (' 11.png') # capture full screen and save except Exception as e: print e

2. Set browser width and height

From selenium import webdriverobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") obj.set_page_load_timeout (5) obj.set_window_size ('480 phantomjs.exe) # set browser width 480, high 800try: obj.get (' http://www.baidu.com') obj.save_screenshot ('12.png') # capture full screen, and save except Exception as e: print e

3. Operate the browser forward and backward

From selenium import webdriverobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") try: obj.get ('http://www.baidu.com') # visits Baidu home page obj.save_screenshot (' 1.png') obj.get ('http://www.sina.com.cn') # visits Sina home page obj.save_screenshot (' 2.png') obj.back () # back to Baidu home page obj. Save_screenshot ('3.png') obj.forward () # advances to Sina homepage obj.save_screenshot (' 4.png') except Exception as e: print e

Fourth, operate the test object

After locating the element, we should perform some operations on the corresponding object in order to achieve some of our specific purposes, so let's introduce some common operations provided by Webdriver.

From selenium import webdriverobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") obj.set_page_load_timeout (5) try: obj.get ('http://www.baidu.com') print obj.find_element_by_id ("cp"). Text # gets the text information of the element obj.find_element_by_id (' kw'). Clear () # clears the content obj.find_element_ of the input box By_id ('kw'). Send_keys (' Hello') # enter Hello obj.find_element_by_id ('su'). Click () # in the input box to click the button obj.find_element_by_id (' su'). Submit () # to submit the form content except Exception as e: print e

5. Keyboard events

1. Usage of keyboard keys

From selenium.webdriver.common.keys import Keysobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") obj.set_page_load_timeout (5) try: obj.get ('http://www.baidu.com') obj.find_element_by_id (' kw'). Send_keys (Keys.TAB) # to clear the contents of the input box Equivalent to clear () obj.find_element_by_id ('kw'). Send_keys (' Hello') # enter Hello obj.find_element_by_id ('su'). Send_keys (Keys.ENTER) # in the input box through the positioning button Replace click () except Exception as e: print e with enter (enter)

2. Use of keyboard key combinations

From selenium import webdriverfrom selenium.webdriver.common.keys import Keysobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") obj.set_page_load_timeout (5) try: obj.get ('http://www.baidu.com') obj.find_element_by_id (' kw'). Send_keys (Keys.TAB) # to clear the contents of the input box Equivalent to clear () obj.find_element_by_id ('kw'). Send_keys (' Hello') # enter Hello obj.find_element_by_id ('kw'). Send_keys (Keys.CONTROL,'a') # ctrl + a full selection input box content obj.find_element_by_id (' kw'). Send_keys (Keys.CONTROL,'x') # ctrl + x cut input box content except Exception as e: print e

VI. The problem of Chinese garbled code

Selenium2 will report an error when typing Chinese into python's send_keys (). In fact, it can be done by adding a u to unicode in front of the Chinese.

VII. Mouse events

1. Right-click

From selenium import webdriverfrom selenium.webdriver.common.action_chains import ActionChainsobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") try: obj.get ("http://pan.baidu.com") obj.find_element_by_id ('TANGRAM__PSP_4__userName'). Send_keys (' 13201392325') # locate and enter the user name obj.find_element_by_id ('TANGRAM__PSP_4__password'). Send_ Keys ('18399565576lu') # locate and enter the password obj.find_element_by_id (' TANGRAM__PSP_4__submit'). Submit () # submit form content f = obj.find_element_by_xpath ('/ html/body/div/div [2] / div [2] /.) # navigate to the label ActionChains (obj). Context_click (f). Perform () # Right-click on the located element except Exception as e: print e

2. Double-click the mouse

From selenium import webdriverfrom selenium.webdriver.common.action_chains import ActionChainsobj = webdriver.PhantomJS (executable_path= "D:\ Python27\ Scripts\ phantomjs.exe") try: obj.get ("http://pan.baidu.com") obj.find_element_by_id ('TANGRAM__PSP_4__userName'). Send_keys (' 13201392325') # locate and enter the user name obj.find_element_by_id ('TANGRAM__PSP_4__password'). Send_ Keys ('18399565576lu') # locate and enter the password obj.find_element_by_id (' TANGRAM__PSP_4__submit'). Submit () # submit form content f = obj.find_element_by_xpath ('/ html/body/div/div [2] / div [2] /.) # navigate to the label ActionChains (obj). Double_click (f). Perform () # Double-click the located element except Exception as e: print e Thank you for reading The above is the content of "how to achieve the crawler function of Selenium+PhantomJS+python". After the study of this article, I believe you have a deeper understanding of how to achieve the crawler function of Selenium+PhantomJS+python, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report