On N methods of static dynamic Web Page content acquisition 04/05 Update SLTechnology News&Howtos

On N methods of static dynamic Web Page content acquisition

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The crawler is a very effective way to get the data we need quickly, and the first step of the crawler is to be able to request the remote server to return the web information we need. We know that normally on the browser we just need to enter the correct uniform resource locator url, that is, the web address can easily open the page we want to see. Similarly, when designing a python crawler, we can also call the corresponding library to connect to the network to process the http protocol through parameter settings. For static web pages, libraries such as urllib, urllib2, requests and so on are often used, through which you can easily request the server to return the web page content of a specific address. But if we encounter dynamic web pages loaded by JS, we often don't get the results we want with the previous method. At this time, you can summon the powerful automated testing tool Selenium and call its buddy PhantomJS to upgrade and fight monsters together.

(1) urllib, urllib2, for example:

[python] view plain copy

Import urllib2

Response = urllib2.urlopen ("http://www.baidu.com")

Print response.read ()

As long as the above is given a url, such as Baidu, you can single-handedly read the web page source code corresponding to this URL by calling the urllib2 library, and the code is very concise. In the actual crawler, considering the other party's anti-crawler mechanism, network response time or the need to add additional information to send requests, we need to add a few more lines of code. the aim is to make the server believe that the requests received come from normal access objects as far as possible. To make the program logic clear, we can design a request object for urlopen as the incoming parameter, for example:

[python] view plain copy

Import urllib

Import urllib2

# add url

Url = 'xxx'

Request = urllib2.Request (url)

# to simulate browser behavior and disguise the other party to identify the problem, you can add a Headers attribute. For example, the following agent is used to set the request identity:

User_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'

Headers = {'User-Agent': user_agent}

Request = urllib2.Request (url, headers)

# sometimes you need to provide some information, such as user name and password, to visit some websites. You can do this:

Values = {"username": "yourname", "password": "?

Data = urllib.urlencode (values)

Request = urllib2.Request (url, data, headers)

# when you encounter a bad network condition, you can set timeout to set how long to wait for the timeout.

Response = urllib2.urlopen (request,timeout=18)

Print response.read ()

For more information, check out this urllib and urllib2.

(2) requests, a simple, beautiful and friendly external library.

All the functions of requests can be accessed through the following seven methods. They all return an instance of the response object.

[python] view plain copy

# create and send a request

Requests.request (method, url, * * kwargs)

Parameters:

Method-- method for the new Request object.

Url-- URL for the new Request object.

Params-(optional) Dictionary or bytes to be sent in the query string for the Request.

Data-(optional) Dictionary, bytes, or file-like object to send in the body of the Request.

Json-(optional) json data to send in the body of the Request.

Headers-(optional) Dictionary of HTTP Headers to send with the Request.

Cookies-(optional) Dict or CookieJar object to send with the Request.

Files-- (optional) Dictionary of 'name': file-like-objects (or {' name': file-tuple}) for multipart encoding upload. File-tuple can be a 2-tuple ('filename', fileobj), 3-tuple (' filename', fileobj, 'content_type') or a 4-tuple (' filename', fileobj, 'content_type', custom_headers), where' content-type' is a string defining the content type of the given file and custom_headers a dict-like object containing additional headers to add for the file.

Auth-(optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.

Timeout (float or tuple)-(optional) How long to wait for the server to send data before giving up, as a float, or a (connect timeout, read timeout) tuple.

Allow_redirects (bool)-(optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.

Proxies-(optional) Dictionary mapping protocol to the URL of the proxy.

Verify-(optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to True.

Stream-(optional) if False, the response content will be immediately downloaded.

Cert-(optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert',' key') pair.

# for example:

Import requests

Url='xxxx'

Headers = {'User-Agent':' Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

Proxies = {'http':' 127.0.0.1 purl 8118'}

Response=requests.request ('GET',url, timeout=20, proxies=proxies, headers=headers)

Return type requests.Response

In addition:

# send a HEAD request.

Requests.head (url, * * kwargs)

# send a GET request

Requests.get (url, params=None, * * kwargs)

# send a POST request

Requests.post (url, data=None, json=None, * * kwargs)

# send a PUT request

Requests.put (url, data=None, * * kwargs)

# send a PATCH request

Requests.patch (url, data=None, * * kwargs)

# send a DELETE request

Requests.delete (url, * * kwargs)

For more details, please refer to the official website requests.

(3) Selenium + PhantomJs, the perfect combination to deal with dynamic web pages efficiently.

Using the previous method, we can simply get the html code of the web page, and it will be very troublesome if we encounter the content of the web page to be rendered by JS. Therefore, we need a tool that can handle the pages to be rendered by JS like a browser. PhantomJs is a web interaction tool based on WebKit, which provides JavaScript API that can realize browser functions such as automatic browsing, Ding Kundan screenshot and so on. Selenium is an automated testing tool, which supports mainstream browsers such as Firefox,Chrome,Safari. With the help of selenium, you can simulate a variety of human web page operations, such as opening a browser, entering information, clicking, turning pages and so on. PhantomJS as a × × face browser, will Selenium catch a cold on it? The answer is very cold, because PhantomJs can not only complete the functions of browsers, but also relatively more efficient. For example:

[python] view plain copy

From selenium import webdriver

From selenium.webdriver.common.keys import Keys

Phantomjs_path ='/ data/opt/brew/lib/node_modules/phantomjs/lib/phantom/bin/phantomjs'

Driver = webdriver.PhantomJS (executable_path=phantomjs_path)

Url = 'xxxx'

Driver.get (url)

# get the source code of the JS rendering page easily

Page_source = driver.page_source.encode ('utf8')

Sometimes we need to do page interaction, that is, we can simulate people in the browser click, input, mouse movement and other behaviors, which needs to do the positioning of page elements first. Among them, WebDriver provides several ways to achieve element positioning:

[python] view plain copy

# ways to locate an element are

Find_element_by_id

Find_element_by_name

Find_element_by_xpath

Find_element_by_link_text

Find_element_by_partial_link_text

Find_element_by_tag_name

Find_element_by_class_name

Find_element_by_css_selector

# locate multiple elements and return a list with the following methods

Find_elements_by_name

Find_elements_by_xpath

Find_elements_by_link_text

Find_elements_by_partial_link_text

Find_elements_by_tag_name

Find_elements_by_class_name

Find_elements_by_css_selector

# for example, the source code of the web page is as follows:

# form can be located in this way

Login_form = driver.find_element_by_id ('loginForm')

# username&password the two elements are positioned as follows

Username = driver.find_element_by_name ('username')

Password = driver.find_element_by_name ('password')

# if you use xpath to locate username, the following methods are all ok

Username = driver.find_element_by_xpath ("/ / form [input/@name='username']")

Username = driver.find_element_by_xpath ("/ / form [@ id='loginForm'] / input [1]")

Username = driver.find_element_by_xpath ("/ / input [@ name='username']")

More information can be found in selenium_python,PhantomJS.

This article briefly introduces various methods of obtaining web source content, including urllib,urllib2,requests, which is often used in static web pages, and selenium,phantomjs, which is often used in dynamic web pages. In the process of crawling, we often need to extract and retain useful information in the web page, so the next step is to introduce how to extract useful information from the source code of the web page.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.