Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the differences between urllib3 and urllib in Python crawlers

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the difference between urllib3 and urllib in Python reptiles". In daily operation, I believe that many people have doubts about the difference between urllib3 and urllib in Python reptiles. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "what is the difference between urllib3 and urllib in Python reptiles?" Next, please follow the editor to study!

Urllib library

Urllib is a python standard library for processing network requests, which contains four modules.

Urllib.request--- request module, used to initiate network requests

Urllib.parse--- parsing module for parsing URL

Urllib.error--- exception handling module, which is used to handle exceptions caused by request

Urllib.robotparser robots.tx--- is used to parse robots.txt files

Urllib.request module

The request module is mainly responsible for constructing and initiating network requests, and adding Headers,Proxy to it. It can be used to simulate the request initiation process of the browser.

Initiate a network request

Manipulate cookie

Add Headers

Use proxy

Introduction of urllib.request.urlopen parameters

Urllib.request.urlopen (url, data=None, [timeout,] *, cafile=None, capath=None, cadefault=False, context=None)

Urlopen is a simple way to send network requests. It receives a url in string format, which sends a network request to the incoming url and returns the result.

Let's start with a simple example:

From urllib import requestresponse = request.urlopen (url= "http://www.httpbin.org/get")print(response.read().decode())

Urlopen sends a get request by default and initiates a POST request when the data parameter is passed. The data parameter is a byte type, a class file object, or an iterable object.

From urllib import requestresponse = request.urlopen (url= "http://www.httpbin.org/post", data=b" username=q123&password=123 ") print (response.read () .decode ())

The timeout can only be set, and if the request exceeds the setting time, an exception is thrown. If timeout is not specified, the system default setting is used. Timeout only works for http,https and ftp connections. It is measured in seconds, for example, you can set the timeout=0.1 timeout to 0.1s.

From urllib import requestresponse = request.urlopen (url= "https://www.baidu.com/",timeout=0.1)Request object

You can make the most basic request with openurl, but these simple parameters are not enough to build a complete request. You can use a more powerful Request object to build a more complete request.

1. Request header add

Requests sent through urllib will have a default Headers: "User-Agent": "Python-urllib/3.6", indicating that the request was sent by urllib. So when we encounter some websites that verify User-Agent, we need to customize Headers to disguise ourselves.

From urllib import requestheaders = {"Referer": "https://www.baidu.com/s?ie=utf-8&f=3&rsv_bp=1&tn=baidu&wd=python%20urllib%E5%BA%93&oq=python%2520urllib%25E5%25BA%2593&rsv_pq=947af0af001c94d0&rsv_t=66135egC273yN5Uj589q%2FvA844PvH9087sbPe9ZJsjA8JA10Z2b3%2BtWMpwo&rqlang=cn&rsv_enter=0&prefixsug=python%2520urllib%25E5%25BA%2593&rsp=0"," User-Agent ":" Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 "} response = request.Request (url=" https://www.baidu.com/",headers=headers)response = request.urlopen (response) print (response.read (). Decode ()) 2. Manipulate cookie

In the process of developing a crawler, the handling of cookie is very important. The handling of urllib's cookie is as follows

From urllib import requestfrom http import cookiejar# creates a cookie object cookie = cookiejar.CookieJar () # create a cookie processor cookies = request.HTTPCookieProcessor (cookie) # take it as a parameter, create an opener object opener = request.build_opener (cookies) # use this opener to send a request res = opener.open ("https://www.baidu.com/")print(cookies.cookiejar)3. Set up proxy

When running a crawler, the blocked IP often occurs, so we need to use the ip agent to deal with it. The IP agent of urllib is set as follows:

From urllib import requesturl = "http://httpbin.org/ip"# proxy address proxy = {" http ":" 172.0.0.1 proxies 3128 "} # proxy processor proxies = request.ProxyBasicAuthHandler (proxy) # create opener object opener = request.build_opener (proxies) res = opener.open (url) print (res.read (). Decode ())

A class or method in the urlib library returns an object of urllib.response after sending a network request. It contains the data result of the request. It contains some properties and methods for us to process the returned results

Read () gets the data returned by the response, which can only be used once.

Readline () reads a row

Info () gets response header information

Geturl () gets the accessed url

Getcode () returns the status code

Urllib.parse module

When parse.urlencode () sends a request, it often needs to pass a lot of parameters, and it will be more troublesome to use string method to concatenate. The parse.urlencode () method is used to concatenate url parameters.

From urllib import parseparams = {"wd": "Test", "code": 1, "height": 188} res = parse.urlencode (params) print (res)

The print result is wd=%E6%B5%8B%E8%AF%95&code=1&height=188

It can also be transferred back to the dictionary by the parse.parse_qs () method

Print (parse.parse_qs ("wd=%E6%B5%8B%E8%AF%95&code=1&height=188")) urllib.error module

The error module is mainly responsible for handling exceptions. If there is an error in the request, we can use the error module to handle it, mainly including URLError and HTTPError.

URLError: is the base class of the error exception module. Exceptions generated by the request module can be handled by this class.

HTTPError: is a subclass of URLError and contains three main attributes

Code: status code of the request

Reason: the cause of the error

Headers: response header

From urllib import request,errortry: response = request.urlopen ("http://pythonsite.com/1111.html")except error.HTTPError as e: print (e.reason) print (e.code) print (e.headers) except error.URLError as e: print (e.reason) else: print (" reqeust successfully ") urllib.robotparse module

Robotparse module is mainly responsible for processing crawler protocol files, robots.txt. The analysis of.

The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Web crawler exclusion criteria" (Robots Exclusion Protocol). Through Robots protocol, websites tell search engines which pages can be crawled and which pages cannot be crawled.

Network library urllib3

Urllib3 is a more powerful existence than the urllib library, and many native systems have begun to use urllib3.

Urllib3 has the following advantages:

Support for HTTP and SOCKS agents

Support for compression coding

100% test coverage

With link pool

Thread safety

Client SLL/TLS authentication

Assist in handling duplicate requests and HTTP relocation

Upload files using multipart encoding

Since urllib3 is not a standard library for Python, we need to download and install it before we use it. The specific commands are as follows:

Pip install urllib3# or conda install urllib3

Next, let's explain how to use the urllib3 library.

Network request GET request

First, when we use the urllib3 library for network requests, we need to create an instance of the PoolManager class, which is used to manage the thread pool.

Next, we will visit Baidu through urllib and return the result of the query. An example is as follows:

Import urllib3http = urllib3.PoolManager () url = "http://www.baidu.com/s"headers = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 "} response = http.request (" GET ", url, fields= {" wd ":" Machine Learning "}, headers=headers) result = response.data.decode (" UTF-8 ") print (result)

After running, the effect is as follows:

Here, we specify the request field of the GET through the fields parameter. However, here first explain the request header, in fact, Baidu has a security mechanism, readers can remove the headers parameter to try. Will return to Baidu's security verification page.

POST request

If you need to submit forms or more complex data to the server, you need to use POST requests. The POST request is relatively simple, simply changing the first parameter of the request to "POST".

Examples are as follows:

Import urllib3http = urllib3.PoolManager () url = "http://httpbin.org/post"headers = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 "} response = http.request (" POST ", url, fields= {" username ":" name "," age ":" 123456 "}, headers=headers) result = response.data.decode (" UTF-8 ") print (result)

After running, the following data is returned:

HTTP response header

The HTTPResponse returned by the urllib3 library for network access. There are some parameters that are carried by default, including the info method. It returns response header data, as shown in the following example:

Import urllib3http = urllib3.PoolManager () url = "http://www.baidu.com/s"headers = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 "} response = http.request (" POST ", url, fields= {" wd ":" Machine Learning "}, headers=headers) for key in response.info () .keys (): print (" key: ", response.info () [key])

After running, the response data returned is as follows:

Upload files

First of all, we need to simply implement a server code for uploading files. Here we use Flask to build a simple server Python program, as follows:

Import flaskimport osUPLOAD_FILE = "uploads" app = flask.Flask (_ _ name__) @ app.route ("/", methods= ["POST"]) def upload_file (): file = flask.request.files ["file"] if file: file.save (os.path.join (UPLOAD_FILE Os.path.basename (file.filename)) return "File upload successful" else: return "File upload failed" if _ _ name__ = = "_ _ main__": app.run ()

After running, it waits for the client to upload the file.

At this point, let's implement how urllib3 uploads files. Examples are as follows:

Import urllib3http = urllib3.PoolManager () with open ("1.jpg", "rb") as f: fileData = f.read () url = "http://127.0.0.1:5000"response = http.request (" POST ", url, fields= {" file ": (" 1.jpg ", fileData," image/jpeg ")}) print (response.data.decode (" UTF-8 "))

The server built by default flask has an interface of 5000, that is, it is accessed through 127.0.0.1 virtual 5000. After running, a 1.jpg image is created in the uploads folder.

At the same time, the console will output the file uploaded successfully, and the server will return the status code 200.

Here, the uploaded file is a key-value pair, where file represents the field of the file uploaded by the server. In the tuple of the value, fileData is the binary form of the file, and "image/jpeg" represents the format of the uploaded file (which can be omitted).

Timeout processing

The underlying HTTP of urllib3 library is implemented based on Socket, and Socket timeout can be divided into connection timeout and read timeout.

Among them, the connection timeout refers to the exception that cannot be connected due to the problem of the server or the wrong domain name during the connection.

The read timeout indicates an exception caused by a long period of time when reading data from the server due to a problem with the server.

Usually, we have two settings for timeout, one is set through http.request (timeout), and the other is set through PoolManager () connection pool. Examples are as follows:

From urllib3 import * http = PoolManager (timeout=Timeout (connect=2.0, read=2.0) with open ("1.jpg", "rb") as f: fileData = f.read () url = "http://127.0.0.1:5000"try: response = http.request (" POST ", url, timeout=Timeout (connect=2.0, read=4.0)) print (response.data.decode (" UTF-8 ")) except Exception as e: print (e)

It is important to note that the timeout set through the connection pool PoolManager is the global timeout, which is used by default even if your subsequent request is not set. If the request timeout is set at the same time, then request shall prevail.

At this point, the study on "what is the difference between urllib3 and urllib in Python crawlers" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report