In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "what is the function of python crawler's urllib library". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the function of the urllib library of python crawler"!
Catalogue
What is the urllib library?
2. The use of urllib library
Urllib.request module
Urllib.parse module
Using try-except to handle timeout
Status status code & & getheaders ()
Break through and anti-climb
What is the urllib library?
The urllib library is used to manipulate the URL of the web page and crawl the contents of the web page.
The urllib package contains the following modules:
Urllib.request-Open and read URL.
Urllib.error-contains the exception thrown by urllib.request.
Urllib.parse-parses the URL.
Urllib.robotparser-parses the robots.txt file
Request and parse modules in the urllib library mainly used by python crawlers
2. The use of urllib library
Let's explain in detail the basic application of these two commonly used modules.
Urllib.request module
Urllib.request defines functions and classes that open URL, including authorization verification, redirection, browser cookies, and so on.
The syntax is as follows:
Urllib.request.urlopen (url, data=None, [timeout,] *, cafile=None
Capath=None, cadefault=False, context=None)
Url:url address.
Data: other data objects sent to the server, default to None.
Timeout: sets the access timeout.
Cafile and capath:cafile are CA certificates, and capath is the path of CA certificates, which is required to use HTTPS.
Cadefault: has been deprecated.
Context:ssl.SSLContext type, which is used to specify SSL settings.
#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.request#get request response = urllib.request.urlopen ("print # returns the object that stores web data # print (response) can try to print print (response.read (). Decode ('utf-8')). )) # read the data through read Using utf-8 decoding to prevent garbled codes in some places
Write the printed content into a html file and open it just like Baidu.
#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.requestresponse = urllib.request.urlopen ("http://www.baidu.com") # returns the object data = response.read (). Decode ('utf-8') # reads the data through read Use utf-8 decoding to prevent garbled code in some places # print (data) with open ("index.html", 'wicked grammatical encodingcoding coding utfMur8') as wfile: # or you can also open it routinely But remember close () wfile.write (data) print ("read end")
Urllib.parse module
Sometimes we crawlers need to simulate the browser for user login and other operations, and then we need to make post requests
But post must have a response after getting the request, that is, we need to have a server. To introduce you to a free server URL, which is used to test http://httpbin.org/. Mainly used to test http and https
We can try to execute it to get the corresponding response.
You can use the Linux command to initiate a request with the URL address of http://httpbin.org/post. Get the response below.
We can also do this through crawlers.
#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.requestimport urllib.parse # parser data = bytes (urllib.parse.urlencode ({"hello": "world"}), encoding='utf-8') # converted to a binary packet with key-value pairs (sometimes entered username: password is like this) There are also some encoding and decoding values and so on. Here, the binary packet response = urllib.request.urlopen is parsed and encapsulated according to the format of utf-8 ("request print (response.read (). Decode ('utf-8')) returned by http://httpbin.org/post",data=data) #) # read the data through read and use utf-8 decoding to prevent garbled code in some places.
Is the comparison between the two responses almost the same?
This is equivalent to a simulated post request. In this way, some websites that need to log in can also be crawled.
Using try-except to handle timeout
When crawling, it is impossible to wait for a response all the time. Sometimes when the network is bad or the web page has anti-crawling or something else. Can't climb out quickly. We can go to the next web page and continue to climb. Just use the timeout attribute.
#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.requesttry: response = urllib.request.urlopen ("http://httpbin.org/get",timeout=0.01) # returns the object in which the web page data is stored, and directly uses the get of this URL to indicate a timeout. If you do not respond for more than 0.01s, you will report an error." Avoid waiting for print (response.read (). Decode ('utf-8')) # to read the data through read and use utf-8 decoding to prevent garbled except urllib.error.URLError as e: print ("timeout\ t\ t error is:", e) status status code & & getheaders ()
Status:
Return 200, the correct response can be crawled
Error 404, no web page found
Error 418, I know you're a reptile.
Getheaders (): get Response Headers
You can also obtain the corresponding value of xx through gethead ("xx"). For example, gethead (content-encoding) in the figure above is gzip.
Break through and anti-climb
First open any web page and press F12 to find Response Headers, then pull to the bottom to find User-Agent. Copy it and save it in preparation for anti-crawling.
Next, let's try, climb Douban directly, do 418 directly, know that you are a reptile, let's disguise it.
Why 418? because if the request is made directly, the User-Agent sent is below, directly telling the browser that we are crawlers. We need to disguise.
#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-2 19 Y-peak# @ FileName: testUrllib.py# Software: PyCharmimport urllib.requestheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 "} request = urllib.request.Request (" http://douban.com", headers=headers) # returns the request The request response = urllib.request.urlopen (request) # that we sent disguised as a browser returns the object data = response.read (). Decode ('utf-8') # which stores the web page data. Read the data through read and use utf-8 decoding to prevent some places from garbled with open ("index.html") as wfile: # or you can also open it routinely But you need to close it finally. Remember close () wfile.write (data).
Of course, anti-crawling can not be so simple, the post request mentioned above is also a very common way to break through anti-crawling, if not, the whole Response Headers will be imitated. Here is another example for reference. The same URL visited by post above
Browser access result
Crawler access result
#-*-codeing = utf-8-*-# @ Author: Y-peak# @ Time: 2021-9-3 0V @ FileName: testUrllib.py# Software: PyCharmimport urllib.requestimport urllib.parseheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 "} url =" http://httpbin.org/post"data = (bytes) (urllib.parse.urlencode ({"account": "password"}), encoding = 'utf-8') request = urllib.request.Request (url, data = data,headers=headers) Method='POST') # returns the request response = urllib.request.urlopen (request) # returns the object data = response.read (). Decode ('utf-8') # reads the data through read and uses utf-8 decoding to prevent garbled print (data) in some places
At this point, I believe you have a deeper understanding of "what is the role of the urllib library of python crawler". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.