How to use python to crawl millions of github user data 07/12 Update SLTechnology News&Howtos

How to use python to crawl millions of github user data

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to use python to crawl millions of github user data. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Flow chart:

The code implemented according to this process

Recursive implementation of running commands

See such a simple process, the first idea is to simply write a recursive implementation, if the performance is poor, and then slowly optimize, so the first version of the code is quickly completed (in the directory recursion). Mongo is used for data storage, repeated requests to determine which redis is used, and celery asynchronous call is used to write mongo data, which requires the rabbitmq service to start normally. After settings.py is configured correctly, use the following steps to start:

Enter the github_spider directory

Execute the command celery-A github_spider.worker worker loglevel=info to start the asynchronous task

Execute the command python github_spider/recursion/main.py to start the crawler

Running result

Because each request has a high latency and the crawler runs slowly, the crawler gets some data after accessing thousands of requests. This is a python project sorted in descending order by number of views:

This is a list of users in descending order of number of fans.

Operation defect

As an aspiring programmer, you can't be satisfied with a small achievement and summarize a few flaws in recursive implementation:

Because it is depth first, when the whole user graph is very large, stand-alone recursion may cause memory overflow and cause the program to crash, which can only be run on a single machine for a short time.

The delay of a single request is too long and the data download speed is too slow.

There is no retry mechanism for links that fail to access within a period of time, so there is the possibility of data loss.

Asynchronous optimization

To solve this time-consuming problem, there are only several solutions, either multi-concurrency, asynchronous access, or a two-pronged approach. For problem 2 above, my initial solution is to request API asynchronously. With this in mind when writing the code in the first place, the code has optimized the calling method and quickly changed it, using grequests in its implementation. This library and requests are the same author, and the code is very simple, that is, request requests are encapsulated with gevent to request data without blocking.

But when I ran it, I found that the program finished very quickly, and a check found that the public network IP had been blocked by github. At that time, thousands of grass-mud horses galloped past in my heart, and there was no way but to sacrifice the reptile's ultimate killer-- the agent. Also specially wrote an auxiliary script to crawl from the Internet free HTTPS agent stored in redis, path proxy/extract.py, each request with the agent, run error retry automatically change the agent and make the wrong agent clear. Originally, there are few free HTTPS agents on the Internet, and many of them cannot be used. Due to a large number of error reports and retries, the access speed is not only not as fast as the original, but also much slower than the original.

Implementation principle of queue implementation

By using the breadth-first traversal method, the URL to be visited can be stored in the queue, and then applying the producer-consumer model can easily achieve multiple concurrency, so as to solve the above problem 2. If you fail for a certain period of time, you can completely solve problem 3 by queuing the data again. Not only that, this method can also support running after the interruption, the program flow chart is as follows:

Run the program

To achieve multi-level deployment (although I only have one machine), message queue uses rabbitmq. You need to create an exchange named github and type direct, and then create four queues named user, repo, follower, and following. The detailed binding relationship is shown in the following figure:

The detailed startup steps are as follows:

Enter the github_spider directory

Execute the command celery-A github_spider.worker worker loglevel=info to start the asynchronous task

Execute the command python github_spider/proxy/extract.py update agent

Execute the command python github_spider/queue/main.py startup script

Queue status diagram:

This is the end of this article on "how to use python to crawl millions of github user data". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.