In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "what is the difference between urllib3 and urllib in Python reptiles". In daily operation, I believe that many people have doubts about the difference between urllib3 and urllib in Python reptiles. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "what is the difference between urllib3 and urllib in Python reptiles?" Next, please follow the editor to study!
Urllib library
Urllib is a python standard library for processing network requests, which contains four modules.
Urllib.request--- request module, used to initiate network requests
Urllib.parse--- parsing module for parsing URL
Urllib.error--- exception handling module, which is used to handle exceptions caused by request
Urllib.robotparser robots.tx--- is used to parse robots.txt files
Urllib.request module
The request module is mainly responsible for constructing and initiating network requests, and adding Headers,Proxy to it. It can be used to simulate the request initiation process of the browser.
Initiate a network request
Manipulate cookie
Add Headers
Use proxy
Introduction of urllib.request.urlopen parameters
Urllib.request.urlopen (url, data=None, [timeout,] *, cafile=None, capath=None, cadefault=False, context=None)
Urlopen is a simple way to send network requests. It receives a url in string format, which sends a network request to the incoming url and returns the result.
Let's start with a simple example:
From urllib import requestresponse = request.urlopen (url= "http://www.httpbin.org/get")print(response.read().decode())
Urlopen sends a get request by default and initiates a POST request when the data parameter is passed. The data parameter is a byte type, a class file object, or an iterable object.
From urllib import requestresponse = request.urlopen (url= "http://www.httpbin.org/post", data=b" username=q123&password=123 ") print (response.read () .decode ())
The timeout can only be set, and if the request exceeds the setting time, an exception is thrown. If timeout is not specified, the system default setting is used. Timeout only works for http,https and ftp connections. It is measured in seconds, for example, you can set the timeout=0.1 timeout to 0.1s.
From urllib import requestresponse = request.urlopen (url= "https://www.baidu.com/",timeout=0.1)Request object
You can make the most basic request with openurl, but these simple parameters are not enough to build a complete request. You can use a more powerful Request object to build a more complete request.
1. Request header add
Requests sent through urllib will have a default Headers: "User-Agent": "Python-urllib/3.6", indicating that the request was sent by urllib. So when we encounter some websites that verify User-Agent, we need to customize Headers to disguise ourselves.
From urllib import requestheaders = {"Referer": "https://www.baidu.com/s?ie=utf-8&f=3&rsv_bp=1&tn=baidu&wd=python%20urllib%E5%BA%93&oq=python%2520urllib%25E5%25BA%2593&rsv_pq=947af0af001c94d0&rsv_t=66135egC273yN5Uj589q%2FvA844PvH9087sbPe9ZJsjA8JA10Z2b3%2BtWMpwo&rqlang=cn&rsv_enter=0&prefixsug=python%2520urllib%25E5%25BA%2593&rsp=0"," User-Agent ":" Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 "} response = request.Request (url=" https://www.baidu.com/",headers=headers)response = request.urlopen (response) print (response.read (). Decode ()) 2. Manipulate cookie
In the process of developing a crawler, the handling of cookie is very important. The handling of urllib's cookie is as follows
From urllib import requestfrom http import cookiejar# creates a cookie object cookie = cookiejar.CookieJar () # create a cookie processor cookies = request.HTTPCookieProcessor (cookie) # take it as a parameter, create an opener object opener = request.build_opener (cookies) # use this opener to send a request res = opener.open ("https://www.baidu.com/")print(cookies.cookiejar)3. Set up proxy
When running a crawler, the blocked IP often occurs, so we need to use the ip agent to deal with it. The IP agent of urllib is set as follows:
From urllib import requesturl = "http://httpbin.org/ip"# proxy address proxy = {" http ":" 172.0.0.1 proxies 3128 "} # proxy processor proxies = request.ProxyBasicAuthHandler (proxy) # create opener object opener = request.build_opener (proxies) res = opener.open (url) print (res.read (). Decode ())
A class or method in the urlib library returns an object of urllib.response after sending a network request. It contains the data result of the request. It contains some properties and methods for us to process the returned results
Read () gets the data returned by the response, which can only be used once.
Readline () reads a row
Info () gets response header information
Geturl () gets the accessed url
Getcode () returns the status code
Urllib.parse module
When parse.urlencode () sends a request, it often needs to pass a lot of parameters, and it will be more troublesome to use string method to concatenate. The parse.urlencode () method is used to concatenate url parameters.
From urllib import parseparams = {"wd": "Test", "code": 1, "height": 188} res = parse.urlencode (params) print (res)
The print result is wd=%E6%B5%8B%E8%AF%95&code=1&height=188
It can also be transferred back to the dictionary by the parse.parse_qs () method
Print (parse.parse_qs ("wd=%E6%B5%8B%E8%AF%95&code=1&height=188")) urllib.error module
The error module is mainly responsible for handling exceptions. If there is an error in the request, we can use the error module to handle it, mainly including URLError and HTTPError.
URLError: is the base class of the error exception module. Exceptions generated by the request module can be handled by this class.
HTTPError: is a subclass of URLError and contains three main attributes
Code: status code of the request
Reason: the cause of the error
Headers: response header
From urllib import request,errortry: response = request.urlopen ("http://pythonsite.com/1111.html")except error.HTTPError as e: print (e.reason) print (e.code) print (e.headers) except error.URLError as e: print (e.reason) else: print (" reqeust successfully ") urllib.robotparse module
Robotparse module is mainly responsible for processing crawler protocol files, robots.txt. The analysis of.
The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Web crawler exclusion criteria" (Robots Exclusion Protocol). Through Robots protocol, websites tell search engines which pages can be crawled and which pages cannot be crawled.
Network library urllib3
Urllib3 is a more powerful existence than the urllib library, and many native systems have begun to use urllib3.
Urllib3 has the following advantages:
Support for HTTP and SOCKS agents
Support for compression coding
100% test coverage
With link pool
Thread safety
Client SLL/TLS authentication
Assist in handling duplicate requests and HTTP relocation
Upload files using multipart encoding
Since urllib3 is not a standard library for Python, we need to download and install it before we use it. The specific commands are as follows:
Pip install urllib3# or conda install urllib3
Next, let's explain how to use the urllib3 library.
Network request GET request
First, when we use the urllib3 library for network requests, we need to create an instance of the PoolManager class, which is used to manage the thread pool.
Next, we will visit Baidu through urllib and return the result of the query. An example is as follows:
Import urllib3http = urllib3.PoolManager () url = "http://www.baidu.com/s"headers = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 "} response = http.request (" GET ", url, fields= {" wd ":" Machine Learning "}, headers=headers) result = response.data.decode (" UTF-8 ") print (result)
After running, the effect is as follows:
Here, we specify the request field of the GET through the fields parameter. However, here first explain the request header, in fact, Baidu has a security mechanism, readers can remove the headers parameter to try. Will return to Baidu's security verification page.
POST request
If you need to submit forms or more complex data to the server, you need to use POST requests. The POST request is relatively simple, simply changing the first parameter of the request to "POST".
Examples are as follows:
Import urllib3http = urllib3.PoolManager () url = "http://httpbin.org/post"headers = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 "} response = http.request (" POST ", url, fields= {" username ":" name "," age ":" 123456 "}, headers=headers) result = response.data.decode (" UTF-8 ") print (result)
After running, the following data is returned:
HTTP response header
The HTTPResponse returned by the urllib3 library for network access. There are some parameters that are carried by default, including the info method. It returns response header data, as shown in the following example:
Import urllib3http = urllib3.PoolManager () url = "http://www.baidu.com/s"headers = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 "} response = http.request (" POST ", url, fields= {" wd ":" Machine Learning "}, headers=headers) for key in response.info () .keys (): print (" key: ", response.info () [key])
After running, the response data returned is as follows:
Upload files
First of all, we need to simply implement a server code for uploading files. Here we use Flask to build a simple server Python program, as follows:
Import flaskimport osUPLOAD_FILE = "uploads" app = flask.Flask (_ _ name__) @ app.route ("/", methods= ["POST"]) def upload_file (): file = flask.request.files ["file"] if file: file.save (os.path.join (UPLOAD_FILE Os.path.basename (file.filename)) return "File upload successful" else: return "File upload failed" if _ _ name__ = = "_ _ main__": app.run ()
After running, it waits for the client to upload the file.
At this point, let's implement how urllib3 uploads files. Examples are as follows:
Import urllib3http = urllib3.PoolManager () with open ("1.jpg", "rb") as f: fileData = f.read () url = "http://127.0.0.1:5000"response = http.request (" POST ", url, fields= {" file ": (" 1.jpg ", fileData," image/jpeg ")}) print (response.data.decode (" UTF-8 "))
The server built by default flask has an interface of 5000, that is, it is accessed through 127.0.0.1 virtual 5000. After running, a 1.jpg image is created in the uploads folder.
At the same time, the console will output the file uploaded successfully, and the server will return the status code 200.
Here, the uploaded file is a key-value pair, where file represents the field of the file uploaded by the server. In the tuple of the value, fileData is the binary form of the file, and "image/jpeg" represents the format of the uploaded file (which can be omitted).
Timeout processing
The underlying HTTP of urllib3 library is implemented based on Socket, and Socket timeout can be divided into connection timeout and read timeout.
Among them, the connection timeout refers to the exception that cannot be connected due to the problem of the server or the wrong domain name during the connection.
The read timeout indicates an exception caused by a long period of time when reading data from the server due to a problem with the server.
Usually, we have two settings for timeout, one is set through http.request (timeout), and the other is set through PoolManager () connection pool. Examples are as follows:
From urllib3 import * http = PoolManager (timeout=Timeout (connect=2.0, read=2.0) with open ("1.jpg", "rb") as f: fileData = f.read () url = "http://127.0.0.1:5000"try: response = http.request (" POST ", url, timeout=Timeout (connect=2.0, read=4.0)) print (response.data.decode (" UTF-8 ")) except Exception as e: print (e)
It is important to note that the timeout set through the connection pool PoolManager is the global timeout, which is used by default even if your subsequent request is not set. If the request timeout is set at the same time, then request shall prevail.
At this point, the study on "what is the difference between urllib3 and urllib in Python crawlers" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.