Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the function of python crawler's urllib library?

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what is the function of python crawler's urllib library". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the function of the urllib library of python crawler"!

Catalogue

What is the urllib library?

2. The use of urllib library

Urllib.request module

Urllib.parse module

Using try-except to handle timeout

Status status code & & getheaders ()

Break through and anti-climb

What is the urllib library?

The urllib library is used to manipulate the URL of the web page and crawl the contents of the web page.

The urllib package contains the following modules:

Urllib.request-Open and read URL.

Urllib.error-contains the exception thrown by urllib.request.

Urllib.parse-parses the URL.

Urllib.robotparser-parses the robots.txt file

Request and parse modules in the urllib library mainly used by python crawlers

2. The use of urllib library

Let's explain in detail the basic application of these two commonly used modules.

Urllib.request module

Urllib.request defines functions and classes that open URL, including authorization verification, redirection, browser cookies, and so on.

The syntax is as follows:

Urllib.request.urlopen (url, data=None, [timeout,] *, cafile=None

Capath=None, cadefault=False, context=None)

Url:url address.

Data: other data objects sent to the server, default to None.

Timeout: sets the access timeout.

Cafile and capath:cafile are CA certificates, and capath is the path of CA certificates, which is required to use HTTPS.

Cadefault: has been deprecated.

Context:ssl.SSLContext type, which is used to specify SSL settings.

#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.request#get request response = urllib.request.urlopen ("print # returns the object that stores web data # print (response) can try to print print (response.read (). Decode ('utf-8')). )) # read the data through read Using utf-8 decoding to prevent garbled codes in some places

Write the printed content into a html file and open it just like Baidu.

#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.requestresponse = urllib.request.urlopen ("http://www.baidu.com") # returns the object data = response.read (). Decode ('utf-8') # reads the data through read Use utf-8 decoding to prevent garbled code in some places # print (data) with open ("index.html", 'wicked grammatical encodingcoding coding utfMur8') as wfile: # or you can also open it routinely But remember close () wfile.write (data) print ("read end")

Urllib.parse module

Sometimes we crawlers need to simulate the browser for user login and other operations, and then we need to make post requests

But post must have a response after getting the request, that is, we need to have a server. To introduce you to a free server URL, which is used to test http://httpbin.org/. Mainly used to test http and https

We can try to execute it to get the corresponding response.

You can use the Linux command to initiate a request with the URL address of http://httpbin.org/post. Get the response below.

We can also do this through crawlers.

#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.requestimport urllib.parse # parser data = bytes (urllib.parse.urlencode ({"hello": "world"}), encoding='utf-8') # converted to a binary packet with key-value pairs (sometimes entered username: password is like this) There are also some encoding and decoding values and so on. Here, the binary packet response = urllib.request.urlopen is parsed and encapsulated according to the format of utf-8 ("request print (response.read (). Decode ('utf-8')) returned by http://httpbin.org/post",data=data) #) # read the data through read and use utf-8 decoding to prevent garbled code in some places.

Is the comparison between the two responses almost the same?

This is equivalent to a simulated post request. In this way, some websites that need to log in can also be crawled.

Using try-except to handle timeout

When crawling, it is impossible to wait for a response all the time. Sometimes when the network is bad or the web page has anti-crawling or something else. Can't climb out quickly. We can go to the next web page and continue to climb. Just use the timeout attribute.

#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.requesttry: response = urllib.request.urlopen ("http://httpbin.org/get",timeout=0.01) # returns the object in which the web page data is stored, and directly uses the get of this URL to indicate a timeout. If you do not respond for more than 0.01s, you will report an error." Avoid waiting for print (response.read (). Decode ('utf-8')) # to read the data through read and use utf-8 decoding to prevent garbled except urllib.error.URLError as e: print ("timeout\ t\ t error is:", e) status status code & & getheaders ()

Status:

Return 200, the correct response can be crawled

Error 404, no web page found

Error 418, I know you're a reptile.

Getheaders (): get Response Headers

You can also obtain the corresponding value of xx through gethead ("xx"). For example, gethead (content-encoding) in the figure above is gzip.

Break through and anti-climb

First open any web page and press F12 to find Response Headers, then pull to the bottom to find User-Agent. Copy it and save it in preparation for anti-crawling.

Next, let's try, climb Douban directly, do 418 directly, know that you are a reptile, let's disguise it.

Why 418? because if the request is made directly, the User-Agent sent is below, directly telling the browser that we are crawlers. We need to disguise.

#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.requestheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 "} request = urllib.request.Request (" http://douban.com", headers=headers) # returns the request The request response = urllib.request.urlopen (request) # that we sent disguised as a browser returns the object data = response.read (). Decode ('utf-8') # which stores the web page data. Read the data through read and use utf-8 decoding to prevent some places from garbled with open ("index.html") as wfile: # or you can also open it routinely But you need to close it finally. Remember close () wfile.write (data).

Of course, anti-crawling can not be so simple, the post request mentioned above is also a very common way to break through anti-crawling, if not, the whole Response Headers will be imitated. Here is another example for reference. The same URL visited by post above

Browser access result

Crawler access result

#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-3 0V @ FileName: testUrllib.py# Software: PyCharmimport urllib.requestimport urllib.parseheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 "} url =" http://httpbin.org/post"data = (bytes) (urllib.parse.urlencode ({"account": "password"}), encoding = 'utf-8') request = urllib.request.Request (url, data = data,headers=headers) Method='POST') # returns the request response = urllib.request.urlopen (request) # returns the object data = response.read (). Decode ('utf-8') # reads the data through read and uses utf-8 decoding to prevent garbled print (data) in some places

At this point, I believe you have a deeper understanding of "what is the role of the urllib library of python crawler". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report