How to use scrapy to crawl and optimize 80 million user data 05/07 Update SLTechnology News&Howtos

How to use scrapy to crawl and optimize 80 million user data

2025-05-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to use scrapy for 80 million user data crawling and optimization, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Recently, I am going to make up for this piece of data analysis. coupled with the fact that I have been listening to live broadcasts in the Himalayas, a favorite anchorman suddenly crawled all the anchors and reward information in the Himalayas to find the more popular anchors and rich brothers in the Himalayas and see how these rich people are profligate.

Crawl information analysis

Open the Himalayan anchors page to view the popular anchors

The first is Himalayan good Voice, official accounts, and many people's Himalayan accounts should follow this by default. We see that the number of followers is more than 80 million, and the actual number of Himalayan users must exceed this number. For the time being, we estimate that the number of crawable users is 100 million. The anchor page only displays 550 pages, with 20 users per page. My idea is to crawl the displayed anchor information and enter the host home page.

Crawl relevant information, and then view fan information

The fan page displays only 10 pages, with 10 users per page. Although it doesn't look like much, we can expand it. After each fan clicks in, there is another user's home page, and we can crawl his fan information. Just extend it all the time, and then use dereprocessing to filter user data that has been crawled.

The data we want to crawl: user name, profile, number of fans, number of followers, voice, number of albums.

In addition, there is appreciation information that needs to be crawled through APP, so let's grab user information first.

Crawling technology selection

With such a large amount of data crawling, an excellent framework is essential, so we use the famous scrapy framework as the basis for crawling. In addition, distributed crawling is also essential. Although I don't have so many machines to do, I've thought about it. New users of a series of CVMs, such as Baidu Cloud, Aliyun, Tencent Cloud, Huawei Cloud, and so on, have a trial period of several days. Don't you have this cluster machine? Hey

We use MongoDB in the database because our data does not require much precision. Redis must be a must. But as an in-memory database, the size of memory is what we have to consider. Our derefiltering is all in redis, so we have to optimize the alignment. For specific reasons, please see:

Why must redis storage be optimized?

I first grabbed some of the data on my machine and checked the request list and de-relist in redis.

You can tell from the amount of data in the request list that the download is slow, which is why we use distributed crawling. Then take a look at the heavy data, 750000. A small amount of data, but take a look at the memory footprint.

After executing the delete statement flushall, check the memory usage

It is bearable to occupy 260MB of memory on my Mac with 8GB of memory, but on my poor cloud server with only 1G of memory, it is so stuck that I can hardly link up. We are aiming at 80 million data. We actually crawled more than 200,000 pieces of valid data and re-recorded more than 700,000 pieces of data. If we reach 100 million pieces of data, according to the current situation, the server will not be able to reach.

There was also a xmla:items structure that stored our crawling data, which I extracted into MongoDB. In xmla:requests, there is a list of requests to be crawled, and when we crawl and download, the amount of data will gradually decrease, at least not infinitely. But what is accessed in this xmla:dupefilter is deduplicated data, and every request is recorded, so this data will only grow as we crawl. So this is the focus of our optimization.

Let's plan what to do next and follow the steps:

Installation and deployment of docker environment

Redis cluster configuration operation

Analysis of user data crawling process

Analysis on the process of grabbing user reward Information

Use BloomFilter to modify scrapy-redis to reduce filter memory footprint

Anti-crawling processing: IP proxy pool, User-Agent pool

Deploy a distributed environment using Gerapy and docker

Grab data cleaning, data analysis and planning

After reading the above, have you mastered how to use scrapy to crawl and optimize 80 million user data? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.