In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces "how to use CSS selector to collect target data from web page in Scrapy". In daily operation, it is believed that many people have doubts about how to use CSS selector to collect target data from web page in Scrapy. The editor consulted all kinds of materials and sorted out simple and useful operation methods. I hope it will be helpful for you to answer the question of "how to use CSS selector to collect target data from web pages in Scrapy"! Next, please follow the editor to study!
/ CSS Foundation /
The functions of CSS selector and Xpath selector are the same, both help us to locate a specific element in the web page structure, but there are differences in syntax expression. Xpath selectors can already help us extract information, so why learn CSS selectors?
Radish and green vegetables have their own preferences, and for friends with different knowledge backgrounds, they can extract web page information. Any cat that can catch a mouse is a good cat. Similarly, as long as you can extract information, be it regular expressions, BeateafulSoup, Xpath selectors, or CSS selectors, they are good selectors, but they are different in efficiency and difficulty. In addition, the CSS selector is much easier for the front-end partners.
CSS selector powerful, from the practical point of view, the following are some of the more commonly used CSS selector syntax, relatively simple, but also very practical syntax, I hope you can firmly grasp, later in the extraction of web page information will get twice the result with half the effort.
With the above CSS foundation, let's put it into practice.
/ practical application /
Still take the previous website as an example, our target data are title, release date, theme, text content, likes, collections, comments, and so on.
1, with regard to the title part, we analyzed it before using the expression of Xpath, and got a unique positioning tag. I will not repeat it here, as shown in the following figure.
2. Still using the debug mode of scrapyshell to assist, combined with the above basic CSS syntax, the specific CSS expression of the title is shown in the following figure.
It is important to note that the way to get the text content of the tag in CSS is to follow the CSS expression with ":: text", keeping in mind that there are two colons, unlike Xpath expressions. This expression looks a little more concise than the Xpath expression, so in some cases, if you think that the expression of the CSS selector is shorter or easier to understand than the Xpath expression, you can choose the CSS selector first. There are no specific requirements. You can choose according to your preference, and vice versa. Of course, you can also cross-use two or more selectors in a crawler file at the same time.
3, the next step is the extraction of the release date, which is still an interactive way to achieve the interaction between the web page and the source code, in which the label "entry-meta-hide-on-mobile" is globally unique and can be easily located to the element, as shown in the following figure.
4. According to the structure of the web page, we can easily write the CSS expression of the release date. We can test it in scrapy shell first, and then write the selector expression into the crawler file, as shown in the following figure.
5. With regard to the CSS expression of the topic tag of the article, you can see that it is below the date on the page structure, as shown in the following figure.
6. The topic tag of the article can be obtained by changing the CSS expression of the release date. The topic tag of the article is under the a tag, as shown in the following figure.
After getting the entire list, use the join function to concatenate the elements in the array with commas to generate a new string called tags, and then write it to the Scrapy crawler file.
7. For the number of likes, the analysis method is the same as before, and the data can be located by finding the only label "vote-post-up".
8, the number of likes under the h20 tag, according to the structure of the page to write the CSS expression, the debugging process is shown below.
The number of likes taken out is a string that needs to be converted to a number using int ().
At this point, the study on "how to use CSS selector in Scrapy to collect target data from web pages" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.