In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
Python uses regular expressions to teach you to deal with the example analysis of JD.com 's commodity information. I believe that many inexperienced people are at a loss about this. Therefore, this article summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
JD.com (JD.com) is China's largest proprietary ecommerce company, with a 56.3 per cent market share in China's proprietary B2C ecommerce market in the first quarter of 2015. Such a huge e-commerce website, the above commodity information is massive, the editor today with small partners to use regular expressions, and based on the input keywords to achieve the theme crawler.
First of all, go to JD.com 's website and enter the goods you want to query. Here, the editor uses the keyword "dog. Grain" as the search object, and then get the following URL: https://search.jd.com/Search?keyword=%E7%8B%97%E7%B2%AE&enc=utf-8, in fact, the parameter% E7%8B%97%E7%B2%AE decodes the meaning of "dog. Grain". Then it is very obvious that as long as we enter the parameter keyword and encode it, we can get our target URL, request the web page, get the response, and then use the selector to carry out the next step of accurate collection.
On JD.com online, Dog. The source code of the grain information on JD.com 's official website is as follows:
Dogs. The web page source code of grain information on JD.com 's official website
Without saying much, go straight to the code, as shown in the following figure. The editor uses py3, and it is recommended that you use the py3 version in the future. Usually the way of URL encoding is to convert the characters that need to be encoded into the form of% xx. Generally speaking, URL coding is based on UTF-8, and of course some are related to the browser platform. The quote method is provided in the urllib library of Python, which can encode the string of URL so that it can enter the corresponding web page.
Regular expressions, also known as regular expressions, regular expressions (English: Regular Expression, often abbreviated as regex, regexp or RE in code), is a powerful tool for pattern matching and substitution. After finding the target web page, call the urlopen function in urllib to open the web page and get the source code, and then use regular expressions to achieve accurate collection of target information.
Using regular expression to realize accurate collection of target information
Regular expression writing in this program is really complicated and takes up many lines, but the main regular expression used is [wW] +? And [sS] +?
[sS] or [wW] means full wildcard. S refers to white space, including spaces, line breaks, tab indentation, and so on, while S is just the opposite. Such a positive and negative, it means that all the characters, complete, word-for-word. In addition, the symbol [] indicates that the individual characters contained in it appear in any order, such as the following rule: [ace] *, which means that whenever the three arbitrary letters of a/c/e appear, they will be matched.
In addition, [s] means that whenever there is white space, it matches; [S] means that if it is not blank, it matches. So the combination of them means that all of them match, and the corresponding ones are [wW] and so on, which have exactly the same meaning. In fact, the usage of [sS] and [wW] is better than "." There are more matches, because "." Line breaks are not matched, and people are used to using fully wildcard patterns such as [sS] or [wW] when there is a newline match.
The final output effect is as follows:
Output effect picture
So the friends can get the dog. The commodity information of grain, of course, the editor just threw a brick to attract jade here, only matched four pieces of information, and only made a single-page acquisition. Friends who need more data can correct the regular expressions and set up multiple pages to achieve the effect you want. In the next article, the editor will use the beautiful soup BeautifulSoup to match the target data to achieve the accurate acquisition of target information.
Finally, I'll give you a brief introduction to regular expressions. Regular expressions use a single string to describe and match a series of strings that match a syntactic rule. In many text editors, regular expressions are often used to retrieve and replace text that matches a pattern.
Regular expressions are really obscure for beginners, but you can learn them slowly, and you don't have to write them down completely, but you need to know when and what parameters you need to use it smoothly.
After reading the above, have you mastered Python's method of using regular expressions to teach you to deal with the example analysis of JD.com 's commodity information? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.