Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Github user data crawler in Python

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

How to use Github user data crawler in Python? for this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Preface

The main goal is to crawl the fan data of specified users on Github and to do a simple visual analysis of the crawled data. Let's get started happily.

Development tools

Python version: 3.6.4

Related modules:

Bs4 module

Requests module

Argparse module

Pyecharts module

And some modules that come with python.

Environment building

Install Python and add it to the environment variable, and pip installs the relevant modules you need.

Data crawling

It feels like I haven't used beautifulsoup for a long time, so let's use it to parse web pages and get the data we want. Take my own account as an example:

Let's first grab the user names of all followers, which are in a tag similar to the one shown below:

You can easily extract them with beautifulsoup:

'' get followers username''def getfollowernames (self): print (' [INFO]: getting all followers usernames for% s.'% self.target_username) page = 0 follower_names = [] headers = self.headers.copy () while True: page + = 1 followers_url = f 'https://github.com/{self.target_username}?page={page}&tab=followers 'try: response = requests.get (followers_url Headers=headers, timeout=15) html = response.text if've reached the end' in html: break soup = BeautifulSoup (html, 'lxml') for name in soup.find_all (' span' Class_='link-gray pl-1'): print (name) follower_names.append (name.text) for name in soup.find_all ('span' Class_='link-gray'): print (name) if name.text not in follower_names: follower_names.append (name.text) except: pass time.sleep (random.random () + random.randrange (0 2) headers.update ({'Referer': followers_url}) print (' [INFO]: successfully obtained% s followers usernames for% s.% (self.target_username, len (follower_names) return follower_names

Then, according to these user names, we can go to their home page to grab the detailed data of the corresponding users, and each home page link is constructed as follows:

Https://github.com/ + user name example: https://github.com/CharlesPikachu

The data we want to capture include:

Similarly, we use beautifulsoup to extract this information:

For idx, name in enumerate (follower_names): print ('[INFO]: crawling details of user% s.'% name) user_url = f 'https://github.com/{name}' try: response = requests.get (user_url, headers=self.headers, timeout=15) html = response.text soup = BeautifulSoup (html) 'lxml') #-- get the user name username = soup.find_all (' span', class_='p-name vcard-fullname d-block overflow-hidden') if username: username = [name, username [0] .text] else: username = [name,'] #-location position = soup.find_all ('span' Class_='p-label') if position: position = position [0] .text else: position ='#-- number of warehouses, number of stars, followers, following overview = soup.find_all ('span' Class_='Counter') num_repos = self.str2int (overview [0] .text) num_stars = self.str2int (overview [2] .text) num_followers = self.str2int (overview [3] .text) num_followings = self.str2int (overview [4] .text) #-contribution (last year) num_contributions = soup.find_all ('h3') Class_='f4 text-normal mb-2') num_contributions = self.str2int (num_contributions [0] .text.replace ('\ ncow,'). Replace ('',''). \ replace ('contributioninthelastyear','). Replace ('contributionsinthelastyear',') #-Save data info = [username, position, num_repos, num_stars, num_followers, num_followings, num_contributions] print (info) follower_ infos [str (idx)] = info except: pass time.sleep (random.random () + random.randrange (0) 2) data visualization

Here, take our own fan data as an example, about 1200.

Let's take a look at the distribution of the number of code they submitted over the past year:

The person with the most submissions is fengjixuchui, which has been submitted 9437 times in the past year. On average, I have to submit it more than 20 times a day, which is too diligent.

Let's take a look at the distribution of the number of warehouses owned by each person:

I thought it would be a monotonous curve, but I seem to underestimate you.

Next, let's take a look at the quantity distribution of star others:

It's okay, at least not all of them are "whoring for nothing".

. To praise the brother named lifa123, he gave 18700 to others, which is too show.

Let's take a look at the distribution of the number of fans of more than 1000 people:

This is the answer to the question about how to use Github user data crawler in Python. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report