In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
Editor to share with you how to use Urllib to parse links in Python3, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!
Parse is also provided in the Urllib library, which defines a standard interface for dealing with URL, such as extracting, merging and link conversion of various parts of URL. It supports URL processing of the following protocols: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais.
1. Urlparse ()
The urlparse () method can identify and segment URL. Let's take a look at it with an example:
From urllib.parse import urlparseresult = urlparse ('http://www.baidu.com/index.html;user?id=5#comment')print(type(result), result)
Here we use the urlparse () method to parse the URL, first outputting the type of the parsing result, and then outputting the result.
Running result:
ParseResult (scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment') Python resource sharing qun 784758214, with installation packages, PDF, learning videos, here is a gathering place for Python learners, zero foundation, advanced, welcome
As you can see, the returned result is an object of type ParseResult, which contains six parts, namely scheme, netloc, path, params, query, and fragment.
Observe the URL of the instance:
Http://www.baidu.com/index.html;user?id=5#comment
The urlparse () method divides it into six parts. In general, you can see that there are specific delimiters during parsing, such as: / / before the scheme, representing the protocol, the first / before the netloc, that is, the domain name, the semicolon; and before the params, representing the parameters.
So you can get a standard link format as follows:
Scheme://netloc/path;parameters?query#fragment
A standard URL will conform to this rule, and we can parse it with the urlparse () method.
Besides this most basic parsing method, is there any other configuration for the urlopen () method? Next, take a look at its API usage:
Urllib.parse.urlparse (urlstring, scheme='', allow_fragments=True)
You can see that it has three parameters:
Urlstring is required, that is, the URL to be parsed.
Scheme is the default protocol (such as http, https, etc.). If the link does not have protocol information, it will be used as the default protocol.
Let's use an example to feel it:
From urllib.parse import urlparseresult = urlparse ('www.baidu.com/index.html;user?id=5#comment', scheme='https') print (result)
Running result:
ParseResult (scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
You can see that the URL we provided does not contain the first scheme information, but by specifying the default scheme parameter, the result returned is https.
Suppose we bring scheme?
Result = urlparse ('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
The results are as follows:
ParseResult (scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
It can be seen that the scheme parameter takes effect only if the URL does not contain scheme information. If there is scheme information in the URL, the parsed scheme is returned.
Allow_fragments, that is, whether to ignore fragment, if it is set to the False,fragment part will be ignored, it will be resolved to path, parameters, or part of query, the fragment part is empty.
Let's use an example to feel it:
From urllib.parse import urlparseresult = urlparse ('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False) print (result)
Running result:
ParseResult (scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')
What if parameters and query are not included in URL?
Let's take another example:
From urllib.parse import urlparseresult = urlparse ('http://www.baidu.com/index.html#comment', allow_fragments=False) print (result)
Running result:
ParseResult (scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
You can see that when params and query are not included in URL, fragment is parsed as part of path.
The returned result ParseResult is actually a tuple, which can be obtained either by index order or by attribute name. An example is as follows:
From urllib.parse import urlparseresult = urlparse ('http://www.baidu.com/index.html#comment', allow_fragments=False) print (result.scheme, result [0], result.netloc, result [1], sep='\ n')
Here, we get scheme and netloc with index and property name, respectively. The running result is as follows:
Httphttpwww.baidu.comwww.baidu.com
It can be found that the two results are consistent, and both methods can be obtained successfully.
2. Urlunparse ()
With urlparse (), there is its opposite method, urlunparse ().
The parameter it accepts is an iterable object, but its length must be 6, otherwise it will throw a problem of insufficient or too many parameters.
Let's first feel it with an example:
From urllib.parse import urlunparsedata = ['http',' www.baidu.com', 'index.html',' user','a 'print,' comment'] print (urlunparse (data))
The parameter data uses a list type, but you can also use other types such as tuples or specific data structures.
The running results are as follows:
Http://www.baidu.com/index.html;user?a=6#comment
In this way, we successfully implemented the construction of URL.
3. Urlsplit ()
This is very similar to the urlparse () method, except that it does not parse the parameters part separately and returns only five results. The parameters in the above example will be merged into path and feel it with an example:
From urllib.parse import urlsplitresult = urlsplit ('http://www.baidu.com/index.html;user?id=5#comment')print(result)
Running result:
SplitResult (scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment') Python resource sharing qun 784758214, with installation packages, PDF, learning videos, here is a gathering place for Python learners, zero foundation, advanced, welcome
You can find that the returned result is SplitResult, which is also a tuple type, which can be obtained by attribute or index. An example is as follows:
From urllib.parse import urlsplitresult = urlsplit ('http://www.baidu.com/index.html;user?id=5#comment')print(result.scheme, result [0])
Running result:
Http http4. Urlunsplit ()
Similar to urlunparse (), it is also a way to combine parts of a link into a complete link, passing in an iterable object, such as a list, tuple, and so on, except that the length must be 5.
Feel it with an example:
From urllib.parse import urlunsplitdata = ['http',' www.baidu.com', 'index.html',' aeg6, 'comment'] print (urlunsplit (data))
Running result:
Http://www.baidu.com/index.html?a=6#comment
The stitching generation of links can also be completed.
5. Urljoin ()
With the urlunparse () and urlunsplit () methods, we can merge links, but only if there are objects of a specific length and each part of the link is clearly separated.
There is another way to generate links, using the urljoin () method we can provide a base_url (basic link), the new link as the second parameter, the method will analyze the base_url scheme, netloc, path these three content to supplement the missing part of the new link and return as a result.
Let's use a few examples to feel it:
From urllib.parse import urljoinprint (urljoin ('http://www.baidu.com',' FAQ.html')) print (urljoin ('http://www.baidu.com',' https://cuiqingcai.com/FAQ.html'))print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))print(urljoin('http://www.baidu.com/about.html',) 'https://cuiqingcai.com/FAQ.html?question=2'))print(urljoin('http://www.baidu.com?wd=abc',' https://cuiqingcai.com/index.php'))print(urljoin('http://www.baidu.com','? category=2#comment') print (urljoin ('www.baidu.com','? category=2#comment')) print (urljoin ('www.baidu.com#comment','? category=2'))
Running result:
Http://www.baidu.com/FAQ.htmlhttps://cuiqingcai.com/FAQ.htmlhttps://cuiqingcai.com/FAQ.htmlhttps://cuiqingcai.com/FAQ.html?question=2https://cuiqingcai.com/index.phphttp://www.baidu.com?category=2#commentwww.baidu.com?category=2#commentwww.baidu.com?category=2
It can be found that base_url provides three items, scheme, netloc, and path. If these three items do not exist in the new link, then add them, and if the new link exists, then use the new link part. Parameters, query and fragments in base_url do not work.
Through the above functions, we can easily achieve the resolution, flattening and generation of links.
6. Urlencode ()
Let's introduce a commonly used urlencode () method, which is very useful when constructing GET request parameters. Let's use an example to feel it:
From urllib.parse import urlencodeparams = {'name':' germey', 'age': 22} base_url =' http://www.baidu.com?'url = base_url + urlencode (params) print (url)
We first declare a dictionary, represent the parameters, and then call the urlencode () method to serialize it into URL standard GET request parameters.
Running result:
Http://www.baidu.com?name=germey&age=22
You can see that the parameter is successfully converted from the dictionary type to the GET request parameter.
This method is very common, and sometimes we use a dictionary to express it in advance to make it easier to construct parameters, and we only need to call this method to convert parameters to URL.
7. Parse_qs ()
With serialization, there must be deserialization. If we have a string of GET request parameters, we can use the parse_qs () method to turn it back to the dictionary. Let's use an example to feel it:
From urllib.parse import parse_qs
Query = 'name=germey&age=22'
Print (parse_qs (query))
Running result:
{'name': [' germey'], 'age': [' 22']}
You can see that this successfully switches back to the dictionary type.
8. Parse_qsl ()
There is also a parse_qsl () method that converts parameters to a list of tuples, as shown in the following example:
From urllib.parse import parse_qslquery = 'name=germey&age=22'print (parse_qsl (query))
Running result:
[('name',' germey'), ('age',' 22')]
You can see that the run result is a list, each element of the list is a tuple, the first content of the tuple is the parameter name, and the second content is the parameter value.
9. Quote ()
The quote () method can convert the content into URL-encoded format, and sometimes garbled problems may occur when there are Chinese parameters in URL, so we can use this method to convert Chinese characters into URL encoding. An example is as follows:
From urllib.parse import quotekeyword = 'wallpaper' url = 'https://www.baidu.com/s?wd=' + quote (keyword) print (url
Here we declare a Chinese search text, then URL it with the quote () method, and the result is as follows:
Https://www.baidu.com/s?wd=%E...
In this way, we can successfully implement the conversion of URL coding.
10. Unquote ()
With the quote () method and, of course, the unquote () method, which can decode URL, an example is as follows:
From urllib.parse import unquoteurl = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'print(unquote(url))Python resource sharing qun 784758214, with installation package, PDF, learning video, this is the gathering place for Python learners, zero foundation, advanced, welcome
This is the result of the URL encoding obtained above, and we use the unquote () method to restore it here. The result is as follows:
Https://www.baidu.com/s?wd= wallpapers
You can see that decoding can be easily implemented using the unquote () method.
The above is all the contents of the article "how to use Urllib to resolve links in Python3". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.