In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces millions of data python crawler skills, the article is very detailed, has a certain reference value, interested friends must read it!
1. Millions of data: selection of target websites, page analysis
1. The choice of target website
The second time I chose the famous Stackoverflow, there are two shrines in the heart of the programmer. One is that there are many good libraries and source code in GitHub, and the other is that there are a lot of Daniel in Stackoverflow to help answer questions. We open Stackoverflow and search for questions related to Python:
two。 Page analysis
It is found that there are more than 880000 questions. Let's take a look at the list rule of the page. We display a maximum of 50 questions per page, with a total of 17776 pages. The data is very large.
Second, the crawler's strategy: page crawling, data storage
So far I have not climbed such a large amount of data, this time it seems that I have to use the artifact scrapy. In fact, before using the artifact, we wrote multiple concurrent multi-thread pools. I measured the speed and crawled 500 pieces of data for about 6 seconds, which is similar to scrapy.
But considering the stability and expansibility, it is more convenient to use Scrapy.
1. Page crawling
Scrapy has a lot of articles and materials, so I won't explain the basics here in detail. Scrapy's built-in function is very powerful. If you play crawler scrapy, you have to learn to use it. Just like Lego toys, you can put it together. Let's talk about a few key points:
1)。 Construction of page list
The question page about Python on the entire stackoverflow is unconventional, and we can easily build a list, such as the first 1000 pages of range (1000) and range (1000 and 2000):
2)。 Crawling of single page
We use scrapy genspider to generate a spider file, we use this file to mainly crawl the content of the page, the content of each question is very regular, we can easily extract it with css:
We mainly extract seven dimensions: question title, question description, view,vote,answers, question time and user name.
two。 Storage of data
We need to build a stored data structure class in items.py to store these seven features.
Then store the parsed data of the page one by one in the spider class above
Item = StackoverflowItem ()
Item ['desc'] = desc
Item ['title'] = title
Item ['view'] = view
Item ['answer'] = answer
Item ['vote'] = vote
Item ['start'] = start
Item ['user'] = user
Yield item
Third, large-scale crawling
Everything seemed to be going well, so we started running crawlers, crawling about 12000 pieces of data, IP would be blocked, and a lot of 429 error codes would appear. It is obvious that there is an anti-climbing strategy. At this time, our native IP has been blocked. If you want to continue crawling at this time, there are two ways:
The first way: use proxy IP
1)。 Build a random proxy pool
There are many free agent IP pools on the Internet. We can parse these web pages locally to build an agent pool, which can be stored in the database, or we can access a paid and stable agent.
2)。 Build a median price for download
The power of Scrapy is that it gives us a lot of interfaces and is very scalable! Basically, it has built-in all aspects of the crawler, and Django is a must for home travel. We only need a few lines of code to do it.
Don't forget to configure it in Setting:
DOWNLOADER_MIDDLEWARES= {
'stackoverflow.middlewares.RandomProxyMiddleware': 543
}
But the ip of the agent is very unstable, especially free. And the crawling time of the proxy IP will be slower than that of the local machine, after all, a transit is added in the middle.
The second method: restart the cat and continue to use the native IP
Generally restart the router at home and sometimes change the local IP address, because the local IP address crawling speed is the fastest, and then slow down the crawling speed, add some delay.
I used the second method, climbed 999 pages of about 49950 pieces of data, and slowed down, about 20 minutes.
In total, we have nearly 900000 data to crawl. If we crawl at this speed, it will take about 7 hours to complete. Scrapy provides perfect exception handling and log analysis. Even if we crawl out wrong, we can still get the crawled data. Of course, if there are conditions, you can crawl on the server, the speed will be faster.
4. Take a fresh look at the data
What the data looks like, let's take a look at 5 items at random. The data is relatively rough and has not been cleaned. What is more valuable in it are its viewpoints and quotes, as well as the time and title.
At present, there are only nearly 100000 items of data. Let's take a fresh look. One of the most popular questions is surrounded by 998 people:
The above is all the content of the article "what are the crawler skills of millions of data python". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.