In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces how to use the Scrapy crawler framework to grab the list of all articles URL, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian with you to understand.
/ concrete implementation /
1. First of all, URL is no longer the URL of a specific article, but the URL of the list of all articles, as shown in the following figure, put the link into the start_urls, as shown in the following figure.
2. Next we will need to change the parse () function, in which we need to implement two things.
One is to get the URL of all the articles on a page and parse it, to get the specific web content in each article, and the other is to get the URL of the next web page and give it to Scrapy for download, and then give it to the parse () function after the download is completed.
With the previous basics of Xpath and CSS selectors, it is relatively easy to get a web link URL.
3, analyze the structure of the web page, using web page interaction tools, we can quickly find that each web page has 20 articles, that is, 20 URL, and the list of articles exists under the id= "archive" tag, and then peel onions to get the URL links we want.
4. Click on the drop-down triangle, and it is not difficult to find that the links on the details page of the article are not deep, as shown in the circle below.
5. According to the tag, we look for it according to the clue, plus we choose a sharp weapon, and getting URL is like searching for something in the bag. Enter the following command in cmd to enter the shell debug window and get twice the result with half the effort. Again, this URL is the URL of all articles, not the URL of an article, otherwise you will not have a result after debugging for a long time.
6. According to the structure analysis of the web page in step 4, we write the CSS expression in shell and output it, as shown in the following figure. Among them, the use of a::attr (href) is very ingenious, and it is also a small skill to extract tag information. It is very convenient to suggest that friends can often use it when extracting web page information.
Thank you for reading this article carefully. I hope the article "how to use the Scrapy crawler framework to crawl the URL of all article lists" shared by the editor is helpful to everyone. At the same time, I also hope you can support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.