In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article is about how to use the urllib library in Python crawlers. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
1. Description:
The urllib library is a http request library built into python, and the requests library is developed based on this library. Although the requests library is more convenient to use, as the most basic request library, it is necessary to understand the principle and usage.
Second, urllib consists of four modules:
Urllib.request
Request module (just like typing the URL in the browser and typing enter)
Urllib.error
Exception handling module (request errors occur and these exceptions can be caught)
Urllib.parse
Url parsing module
Urllib.robotparser
Robots.txt parsing module to determine which website can be climbed and which can not, which is less used.
There is a difference between python2 and python3
In python2:
Import urllib2response = urllib2.urlopen ('http://www.baidu.com')
In python3:
Import urllib.requestresponse = urllib.request.urlopen ('http://www.baidu.com') III, urllib.request1, urlopen function
Urllib.request.urlopen (url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)
Url parameter
From urllib import requestresponse = request.urlopen ('http://www.baidu.com')print(response.read().decode('utf-8'))
Data parameter
When there is no data parameter, a get request is sent, and when the data parameter is added, the request becomes post (using 'http://httpbin.org to test the URL)
Import urllib.requestimport urllib.parsedata1= bytes (urllib.parse.urlencode ({'word':'hello'}), encoding='utf-8') response = urllib.request.urlopen (' http://httpbin.org/post',data = data1) print (response.read ())
The data parameter requires a bytes type, so you need to use the bytes () function to encode, while the first parameter of the bytes function requires a str type, so use urllib.parse.urlencode to convert the dictionary to a string.
Timeout parameter
Set a timeout period within which no response will be thrown.
Import urllib.requesttry: response = urllib.request.urlopen ('http://www.baidu.com', timeout=0.001) print (response.read ()) except: print (' error')
Set the timeout to 0.001 seconds, during which there is no response and output error
2. Response response type import urllibfrom urllib import request response = urllib.request.urlopen ('http://www.baidu.com')print(type(response))
Status code and response header
Import urllibfrom urllib import requestresponse = urllib.request.urlopen ('http://www.baidu.com')print(response.status)print(response.getheaders())print(response.getheader('Server'))
Read method
Import urllib.requestresponse = urllib.request.urlopen ('http://www.baidu.com')print(type(response.read()))print(response.read().decode('utf-8'))
Response.read () returns data in the form of bytes, so it needs to be decoded with decode ('utf-8').
3. Request object
If we need to send complex requests, we need to use a Request object in the urllib library
Import urllib.request # declares a Request object directly and passes url as a parameter directly in request = urllib.request.Request ('http://www.baidu.com')response = urllib.request.urlopen (request) print (response.read (). Decode (' utf-8')
Declare a Request object, pass url to the object as an argument, and then take the pair as an argument to the urlopen function
For more complex requests, add headers
# implement a post request using Request object
Import urllib.requesturl = 'http://httpbin.org/post'headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} data = {'word':'hello'} data = bytes (str (data), encoding='utf-8') req = urllib.request.Request (url = url,data = data,headers = headers,method =' POST') response = urllib.request.urlopen (req) print (response.read (). Decode ('utf-8'))
The above request contains the request method, url, request header, request body, and clear logic.
The Request object also has an add_header method, which can also add header for multiple key-value pairs.
4. Advanced request method
Set up proxy
Many websites will detect the number of visits to an IP in a certain period of time (through traffic statistics, system logs, etc.). If the number of visits is not like that of a normal person, it will prohibit the visit of this IP. ProxyHandler (set the handler of the agent), you can change your own IP address.
From urllib import request # Import request module url = 'http://httpbin.org' # url address handler = request.ProxyHandler ({' http': '122.193.244.243handler 9999'}) # create an agent # handler = request.ProxyHandler using the request module ProxyHandler class ({"http": "account: password @' 122.193.244.243url 9999'"}) # paid agent mode opener = request.build_opener (handler) # Create openerresp with handler = opener.open (url) # use opener.open () to send request print (resp.read ()) # print the returned result
Cookie
Import urllib.requestimport urllib.parseurl = 'https://weibo.cn/5273088553/info'# access normally # headers = {#' User-Agent': 'Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'#} # carries cookie to access headers = {'GET https':' / / weibo.cn/5273088553/info HTTP/1.1', 'Host':' weibo.cn', 'Connection':' keep-alive', 'Upgrade-Insecure-Requests':' 1, 'User-Agent':' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36', 'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', # 'Referer: https':'//weibo.cn/',' Accept-Language': 'zh-CN,zh;q=0.9',' Cookie':'_ T_WM=c1913301844388de10cba9d0bb7bbf1e SUB=_2A253Wy_dDeRhGeNM7FER-CbJzj-IHXVUp7GVrDV6PUJbkdANLXPdkW1NSesPJZ6v1GA5MyW2HEUb9ytQW3NYy19U; SUHB=0bt8SpepeGz439; SCF=Aua-HpSw5-z78-02NmUv8CTwXZCMN4XJ91qYSHkDXH4W9W0fCBpEI6Hy5E6vObeDqTXtfqobcD2D32r0Oroom5jSRk.SSOLoginState=1516199821',} request = urllib.request.Request (url=url, headers=headers) response = urllib.request.urlopen (request) # output all # print (response.read (). Decode ('gbk')) # write the content to the file with open (' weibo.html', 'wb') as fp: fp.write (response.read ()) IV, urllib.error
Three kinds of exceptions can be caught: URLError,HTTPError (a subclass of the URLError class), ContentTooShortError
URLError has only one reason attribute
HTTPError has three properties: code,reason,headers
Import urllib.requestfrom urllib import errortry: response = urllib.request.urlopen ('http://123.com')except error.URLError as e: print (e.reason) import urllibfrom urllib import requestfrom urllib import error# catches the http exception first Then catch the url exception try: response = urllib.request.urlopen ('http://123.com')except error.HTTPError as e: print (e.reason, e.code, e.headers) except error.URLError as e: print (e.reason) else: print (' RequestSucesscake') 5. URL parsing urllib.parse
Urlparse function
This function divides the incoming url into several parts and assigns values to each part.
Import urllibfrom urllib import parseresult = urllib.parse.urlparse ('http://www,baidu.com/index.html;user?id=5#comment')print(type(result))print(result)
As a result, the url was conveniently split.
ParseResult (scheme='http', netloc='www,baidu.com', path='/index.html', params='user', query='id=5', fragment='comment') Process finished with exit code 0
As can be seen from the output, these parts include: protocol type, domain name, path, parameters, query, fragment
Urlparse has several parameters: url,scheme,allow_fragments
When using urlparse, you can specify the default protocol type by using the parameter scheme = 'http'. If url has a protocol type, the scheme parameter will not take effect.
Urlunparse function
Contrary to the urlparse function, it splices url.
Urljoin function
Used to splice url
Urlencode function
You can convert a dictionary to a get request parameter
Thank you for reading! This is the end of this article on "how to use urllib Library in Python Crawler". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.