What are the functions commonly used in urllib library 07/06 Update SLTechnology News&Howtos

What are the functions commonly used in urllib library

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the urllib library commonly used functions of what related knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe you read this urllib library common functions which articles will have a harvest, let's take a look at it.

What is the urllib library?

Python3 integrates the urllib and urllib2 libraries in Python2 into a urllib library, so now we are generally talking about the urllib library in Python3, so what is the urllib library? How do I use the urllib library? What are the common functions of the urllib library?

Urllib is mainly divided into the following four functional modules:

Urllib.request (request module)

Urllib.parse (parsing module)

Urllib.error (exception handling module)

Urllib.robotparser (robots.txt file parsing module)

Urllib is Python's built-in HTTP request library, which can be used directly without installation. It is also a library commonly used by crawler developers. Today, the editor will summarize the basic usage of some common functions of the urllib library.

II. Explanation of urllib usage

1. Urllib.request.urlopen () function

Create a file object that identifies the remote url, and then manipulate the class file object like a local file to get the remote data. The syntax is as follows:

Urllib.request.urlopen (url,data = None, [timeout] *, cafile = None,capath = None,cadefault = False,context = None)

Url: the requested url

Data: the data of the request. If this value is set, it will become a post request.

Timeout: sets the access timeout handle object for a website

Cafile and capath: used in HTTPS requests to set the CA certificate and its path.

Example

From urllib import request

Response = request.urlopen ('http://www.baidu.com') # get request

Print (response.read (). Decode ('utf-8') # gets the content of the response and decodes it

The methods provided by the urlopen () return object are:

Read (), readline (), readlines (), fileno (), close (): operate on HTTPR esponse type data

Info (): returns a HTTPMessage object that represents the header information returned by the remote server

Getcode (): returns the HTTP status code geturl (): returns the requested url

Getheaders (): header information of the response

Getheader ('Server'): returns the value of the parameter Server specified in the response header

Status: return status code

Reason: returns details of the status.

2. Urllib.request.urlretrieve () function

This function can easily save a file on the web page locally. The syntax is as follows:

Urllib.request.urlretrieve (url, filename=None, reporthook=None, data=None)

Url: the address of the remote data

Filename: the path where the file is saved. If empty, it will be downloaded as a temporary file.

Reporthook: the hook function is called once when the server is successfully connected and when each data block is downloaded, including three parameters, including the downloaded data block, the size of the data block, and the total file size, which can be used to display the download progress.

Data from data:post to server.

Example

From urllib import request

Request.urlretrieve ('http://www.baidu.com/','baidu.html') # downloads Baidu's home page information locally

3. Urllib.parse.urlencode () function

Urlencode can convert dictionary data into URL-encoded data. The syntax is as follows:

Urllib.parse.urlencode (query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)

Query: query parameter

Doseq: whether sequence elements are converted separately

Safe: security default

Encoding: encodin

Errors: error default

Quote_via: when the query parameter is composed of str, safe, encoding, and errors are passed to the specified function. The default is quote_plus (), and the enhanced version quote ().

Example

From urllib import parse

Data = {'name': 'W3C Schoolboys' greetings': 'Hello W3C Schoollings' age': 100}

Qs = parse.urlencode (data)

Print (qs)

#% E5%A7%93%E5%90%8D=W3CSchool&%E9%97%AE%E5%A5%BD=Hello+W3CSchool&%E5%B9%B4%E9%BE%84=100

4. Urllib.parse.parse_qs () function

The encoded url parameters can be decoded. The syntax is as follows:

Urllib.parse.parse_qs (qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

Keep_blank_values: indicates whether to display key when value is empty. Default is False.

Strict_parsing: a flag indicating how to handle parsing errors. If False (the default), the error is automatically ignored. Otherwise, the error throws a ValueError exception.

Example

From urllib import parse

Data = {'name': 'W3C Schoolboys' greetings': 'hello W3C Schoollings' age': 100}

Qs = parse.urlencode (data)

Print (qs)

#% E5%A7%93%E5%90%8D=W3CSchool&%E9%97%AE%E5%A5%BD=hello+W3CSchool&%E5%B9%B4%E9%BE%84=100

Print (parse.parse_qs (qs))

# {'name': ['W3C School'],' greetings': ['hello W3C School'],' age': ['100']}

5. Urllib.parse.parse_qsl () function

The basic usage is the same as the parse_qs () function, except that the urllib.parse.parse_qs () function returns a dictionary and the urllib.parse.parse_qsl () function returns the list. The syntax is as follows:

Urllib.parse.parse_qsl (qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

Example

From urllib import parse

Data = {'name': 'W3C Schoolboys' greetings': 'hello W3C Schoollings' age': 100}

Qs = parse.urlencode (data)

Print (parse.parse_qs (qs))

# [('name', 'W3C School`), (' Hello', 'hello W3C School`), (' age', '100')]

6. Urllib.parse.urlparse () and urllib.parse.urlsplit () functions

Sometimes when you get a url and want to split the various components of the url, you can use urlparse or urlsplit to split it. The respective syntax is as follows:

Urllib.parse.urlparse (urlstring, scheme= ", allow_fragments=True)

Urllib.parse.urlsplit (urlstring, scheme= ", allow_fragments=True)

Urlparse and urlsplit are basically exactly the same.

The only difference is that there is a params attribute in urlparse, while urlsplit does not have this params attribute.

Example

From urllib import parse

Url = 'http://www.baidu.com/index.html;user?id=S#comment'

Result = parse.urlparse (url)

# result = parse.urlsplit (url)

Print (result)

Print (result.scheme)

Print (result.netloc)

Print (result.path)

The params attribute is included in print (result.params) # urlparse, but not in urlsplit.

7. Urllib.error module

The error module of urllib defines the exception generated by the urllib.request request. If the request fails, urllib.request throws an exception for the error module.

URLError

The URLError class comes from the error module of urllib, inherits the OSError module, is the base class of the exception module, and has the attribute reason, which returns the cause of the error.

Example

From urllib import request, error

Try:

Resp = request.urlopen ('https://w3cschool.c/index.html')

Except error.URLError as e:

Print (e.reason)

# [Errno 11001] getaddrinfo failed

HTTPError

It is a subclass of URLError, specifically used to handle HTTP request errors, and has the following three properties:

Code: returns the HTTP status code

Reason: cause of exception

Headers: request header

Example

From urllib import request

Response = request.urlopen ('http://www.baidu.com') # get request

Print (response.read (). Decode ('utf-8') # gets the content of the response and decodes it

# 404

Of course, most of the time, exception handling is carried out by combining URLError and HTTPError. First, the error status code, exception reason, request header and other information of url are captured through HTTPError. If it is not an error of this type, URLError is captured, and the cause of the error is output. Finally, else handles the normal logic.

Example

From urllib.request import Request, urlopen

From urllib.error import URLError, HTTPError

Resp = Request ('http://www.baidu.cnc/')

Try:

Response = urlopen (resp)

Except HTTPError as e:

Print ('(www.baidu.cnc) server could not complete the request.')

Print ('error code:', e.code)

Except URLError as e:

Print ('We cannot connect to the server.')

Print ('cause:', e.reason) else:

Print ("Link succeeded!")

Print (response.read () .decode ("utf-8")

These are the functions commonly used in the urllib library. I hope you pay more attention to the connection in front of the screen. The combination of theory and practice is the best way to learn! Recommended reading: Python static crawler, Python Scrapy web crawler.

Finally, let's summarize the common meanings of various status codes:

200: the request is normal, and the server returns data normally

301: permanent redirection. For example, when accessing www.jingdong.com, it will be redirected to www.jd.com.

302: temporary redirection. For example, when you visit a page that requires login, and you do not log in at this time, you will be redirected to the login page.

The requested url cannot be found on the server. In other words, request url error

403: server denied access, insufficient permissions

500: server internal error. Maybe there is a bug on the server.

This is the end of the article on "what are the common functions of the urllib library?" Thank you for reading! I believe you all have a certain understanding of the knowledge of "what are the common functions of the urllib library?" if you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.