Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the crawler skills of millions of data python

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces millions of data python crawler skills, the article is very detailed, has a certain reference value, interested friends must read it!

1. Millions of data: selection of target websites, page analysis

1. The choice of target website

The second time I chose the famous Stackoverflow, there are two shrines in the heart of the programmer. One is that there are many good libraries and source code in GitHub, and the other is that there are a lot of Daniel in Stackoverflow to help answer questions. We open Stackoverflow and search for questions related to Python:

two。 Page analysis

It is found that there are more than 880000 questions. Let's take a look at the list rule of the page. We display a maximum of 50 questions per page, with a total of 17776 pages. The data is very large.

Second, the crawler's strategy: page crawling, data storage

So far I have not climbed such a large amount of data, this time it seems that I have to use the artifact scrapy. In fact, before using the artifact, we wrote multiple concurrent multi-thread pools. I measured the speed and crawled 500 pieces of data for about 6 seconds, which is similar to scrapy.

But considering the stability and expansibility, it is more convenient to use Scrapy.

1. Page crawling

Scrapy has a lot of articles and materials, so I won't explain the basics here in detail. Scrapy's built-in function is very powerful. If you play crawler scrapy, you have to learn to use it. Just like Lego toys, you can put it together. Let's talk about a few key points:

1)。 Construction of page list

The question page about Python on the entire stackoverflow is unconventional, and we can easily build a list, such as the first 1000 pages of range (1000) and range (1000 and 2000):

2)。 Crawling of single page

We use scrapy genspider to generate a spider file, we use this file to mainly crawl the content of the page, the content of each question is very regular, we can easily extract it with css:

We mainly extract seven dimensions: question title, question description, view,vote,answers, question time and user name.

two。 Storage of data

We need to build a stored data structure class in items.py to store these seven features.

Then store the parsed data of the page one by one in the spider class above

Item = StackoverflowItem ()

Item ['desc'] = desc

Item ['title'] = title

Item ['view'] = view

Item ['answer'] = answer

Item ['vote'] = vote

Item ['start'] = start

Item ['user'] = user

Yield item

Third, large-scale crawling

Everything seemed to be going well, so we started running crawlers, crawling about 12000 pieces of data, IP would be blocked, and a lot of 429 error codes would appear. It is obvious that there is an anti-climbing strategy. At this time, our native IP has been blocked. If you want to continue crawling at this time, there are two ways:

The first way: use proxy IP

1)。 Build a random proxy pool

There are many free agent IP pools on the Internet. We can parse these web pages locally to build an agent pool, which can be stored in the database, or we can access a paid and stable agent.

2)。 Build a median price for download

The power of Scrapy is that it gives us a lot of interfaces and is very scalable! Basically, it has built-in all aspects of the crawler, and Django is a must for home travel. We only need a few lines of code to do it.

Don't forget to configure it in Setting:

DOWNLOADER_MIDDLEWARES= {

'stackoverflow.middlewares.RandomProxyMiddleware': 543

}

But the ip of the agent is very unstable, especially free. And the crawling time of the proxy IP will be slower than that of the local machine, after all, a transit is added in the middle.

The second method: restart the cat and continue to use the native IP

Generally restart the router at home and sometimes change the local IP address, because the local IP address crawling speed is the fastest, and then slow down the crawling speed, add some delay.

I used the second method, climbed 999 pages of about 49950 pieces of data, and slowed down, about 20 minutes.

In total, we have nearly 900000 data to crawl. If we crawl at this speed, it will take about 7 hours to complete. Scrapy provides perfect exception handling and log analysis. Even if we crawl out wrong, we can still get the crawled data. Of course, if there are conditions, you can crawl on the server, the speed will be faster.

4. Take a fresh look at the data

What the data looks like, let's take a look at 5 items at random. The data is relatively rough and has not been cleaned. What is more valuable in it are its viewpoints and quotes, as well as the time and title.

At present, there are only nearly 100000 items of data. Let's take a fresh look. One of the most popular questions is surrounded by 998 people:

The above is all the content of the article "what are the crawler skills of millions of data python". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report