How to use urllib Library in Python Crawler 09/19 Update SLTechnology News&Howtos

How to use urllib Library in Python Crawler

2025-09-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article is about how to use the urllib library in Python crawlers. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Description:

The urllib library is a http request library built into python, and the requests library is developed based on this library. Although the requests library is more convenient to use, as the most basic request library, it is necessary to understand the principle and usage.

Second, urllib consists of four modules:

Urllib.request

Request module (just like typing the URL in the browser and typing enter)

Urllib.error

Exception handling module (request errors occur and these exceptions can be caught)

Urllib.parse

Url parsing module

Urllib.robotparser

Robots.txt parsing module to determine which website can be climbed and which can not, which is less used.

There is a difference between python2 and python3

In python2:

Import urllib2response = urllib2.urlopen ('http://www.baidu.com')

In python3:

Import urllib.requestresponse = urllib.request.urlopen ('http://www.baidu.com') III, urllib.request1, urlopen function

Urllib.request.urlopen (url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)

Url parameter

From urllib import requestresponse = request.urlopen ('http://www.baidu.com')print(response.read().decode('utf-8'))

Data parameter

When there is no data parameter, a get request is sent, and when the data parameter is added, the request becomes post (using 'http://httpbin.org to test the URL)

Import urllib.requestimport urllib.parsedata1= bytes (urllib.parse.urlencode ({'word':'hello'}), encoding='utf-8') response = urllib.request.urlopen (' http://httpbin.org/post',data = data1) print (response.read ())

The data parameter requires a bytes type, so you need to use the bytes () function to encode, while the first parameter of the bytes function requires a str type, so use urllib.parse.urlencode to convert the dictionary to a string.

Timeout parameter

Set a timeout period within which no response will be thrown.

Import urllib.requesttry: response = urllib.request.urlopen ('http://www.baidu.com', timeout=0.001) print (response.read ()) except: print (' error')

Set the timeout to 0.001 seconds, during which there is no response and output error

2. Response response type import urllibfrom urllib import request response = urllib.request.urlopen ('http://www.baidu.com')print(type(response))

Status code and response header

Import urllibfrom urllib import requestresponse = urllib.request.urlopen ('http://www.baidu.com')print(response.status)print(response.getheaders())print(response.getheader('Server'))

Read method

Import urllib.requestresponse = urllib.request.urlopen ('http://www.baidu.com')print(type(response.read()))print(response.read().decode('utf-8'))

Response.read () returns data in the form of bytes, so it needs to be decoded with decode ('utf-8').

3. Request object

If we need to send complex requests, we need to use a Request object in the urllib library

Import urllib.request # declares a Request object directly and passes url as a parameter directly in request = urllib.request.Request ('http://www.baidu.com')response = urllib.request.urlopen (request) print (response.read (). Decode (' utf-8')

Declare a Request object, pass url to the object as an argument, and then take the pair as an argument to the urlopen function

For more complex requests, add headers

# implement a post request using Request object

Import urllib.requesturl = 'http://httpbin.org/post'headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} data = {'word':'hello'} data = bytes (str (data), encoding='utf-8') req = urllib.request.Request (url = url,data = data,headers = headers,method =' POST') response = urllib.request.urlopen (req) print (response.read (). Decode ('utf-8'))

The above request contains the request method, url, request header, request body, and clear logic.

The Request object also has an add_header method, which can also add header for multiple key-value pairs.

4. Advanced request method

Set up proxy

Many websites will detect the number of visits to an IP in a certain period of time (through traffic statistics, system logs, etc.). If the number of visits is not like that of a normal person, it will prohibit the visit of this IP. ProxyHandler (set the handler of the agent), you can change your own IP address.

From urllib import request # Import request module url = 'http://httpbin.org' # url address handler = request.ProxyHandler ({' http': '122.193.244.243handler 9999'}) # create an agent # handler = request.ProxyHandler using the request module ProxyHandler class ({"http": "account: password @' 122.193.244.243url 9999'"}) # paid agent mode opener = request.build_opener (handler) # Create openerresp with handler = opener.open (url) # use opener.open () to send request print (resp.read ()) # print the returned result

Import urllib.requestimport urllib.parseurl = 'https://weibo.cn/5273088553/info'# access normally # headers = {#' User-Agent': 'Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'#} # carries cookie to access headers = {'GET https':' / / weibo.cn/5273088553/info HTTP/1.1', 'Host':' weibo.cn', 'Connection':' keep-alive', 'Upgrade-Insecure-Requests':' 1, 'User-Agent':' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36', 'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', # 'Referer: https':'//weibo.cn/',' Accept-Language': 'zh-CN,zh;q=0.9',' Cookie':'_ T_WM=c1913301844388de10cba9d0bb7bbf1e SUB=_2A253Wy_dDeRhGeNM7FER-CbJzj-IHXVUp7GVrDV6PUJbkdANLXPdkW1NSesPJZ6v1GA5MyW2HEUb9ytQW3NYy19U; SUHB=0bt8SpepeGz439; SCF=Aua-HpSw5-z78-02NmUv8CTwXZCMN4XJ91qYSHkDXH4W9W0fCBpEI6Hy5E6vObeDqTXtfqobcD2D32r0Oroom5jSRk.SSOLoginState=1516199821',} request = urllib.request.Request (url=url, headers=headers) response = urllib.request.urlopen (request) # output all # print (response.read (). Decode ('gbk')) # write the content to the file with open (' weibo.html', 'wb') as fp: fp.write (response.read ()) IV, urllib.error

Three kinds of exceptions can be caught: URLError,HTTPError (a subclass of the URLError class), ContentTooShortError

URLError has only one reason attribute

HTTPError has three properties: code,reason,headers

Import urllib.requestfrom urllib import errortry: response = urllib.request.urlopen ('http://123.com')except error.URLError as e: print (e.reason) import urllibfrom urllib import requestfrom urllib import error# catches the http exception first Then catch the url exception try: response = urllib.request.urlopen ('http://123.com')except error.HTTPError as e: print (e.reason, e.code, e.headers) except error.URLError as e: print (e.reason) else: print (' RequestSucesscake') 5. URL parsing urllib.parse

Urlparse function

This function divides the incoming url into several parts and assigns values to each part.

Import urllibfrom urllib import parseresult = urllib.parse.urlparse ('http://www,baidu.com/index.html;user?id=5#comment')print(type(result))print(result)

As a result, the url was conveniently split.

ParseResult (scheme='http', netloc='www,baidu.com', path='/index.html', params='user', query='id=5', fragment='comment') Process finished with exit code 0

As can be seen from the output, these parts include: protocol type, domain name, path, parameters, query, fragment

Urlparse has several parameters: url,scheme,allow_fragments

When using urlparse, you can specify the default protocol type by using the parameter scheme = 'http'. If url has a protocol type, the scheme parameter will not take effect.

Urlunparse function

Contrary to the urlparse function, it splices url.

Urljoin function

Used to splice url

Urlencode function

You can convert a dictionary to a get request parameter

Thank you for reading! This is the end of this article on "how to use urllib Library in Python Crawler". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.