In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
How to use Github user data crawler in Python? for this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.
Preface
The main goal is to crawl the fan data of specified users on Github and to do a simple visual analysis of the crawled data. Let's get started happily.
Development tools
Python version: 3.6.4
Related modules:
Bs4 module
Requests module
Argparse module
Pyecharts module
And some modules that come with python.
Environment building
Install Python and add it to the environment variable, and pip installs the relevant modules you need.
Data crawling
It feels like I haven't used beautifulsoup for a long time, so let's use it to parse web pages and get the data we want. Take my own account as an example:
Let's first grab the user names of all followers, which are in a tag similar to the one shown below:
You can easily extract them with beautifulsoup:
'' get followers username''def getfollowernames (self): print (' [INFO]: getting all followers usernames for% s.'% self.target_username) page = 0 follower_names = [] headers = self.headers.copy () while True: page + = 1 followers_url = f 'https://github.com/{self.target_username}?page={page}&tab=followers 'try: response = requests.get (followers_url Headers=headers, timeout=15) html = response.text if've reached the end' in html: break soup = BeautifulSoup (html, 'lxml') for name in soup.find_all (' span' Class_='link-gray pl-1'): print (name) follower_names.append (name.text) for name in soup.find_all ('span' Class_='link-gray'): print (name) if name.text not in follower_names: follower_names.append (name.text) except: pass time.sleep (random.random () + random.randrange (0 2) headers.update ({'Referer': followers_url}) print (' [INFO]: successfully obtained% s followers usernames for% s.% (self.target_username, len (follower_names) return follower_names
Then, according to these user names, we can go to their home page to grab the detailed data of the corresponding users, and each home page link is constructed as follows:
Https://github.com/ + user name example: https://github.com/CharlesPikachu
The data we want to capture include:
Similarly, we use beautifulsoup to extract this information:
For idx, name in enumerate (follower_names): print ('[INFO]: crawling details of user% s.'% name) user_url = f 'https://github.com/{name}' try: response = requests.get (user_url, headers=self.headers, timeout=15) html = response.text soup = BeautifulSoup (html) 'lxml') #-- get the user name username = soup.find_all (' span', class_='p-name vcard-fullname d-block overflow-hidden') if username: username = [name, username [0] .text] else: username = [name,'] #-location position = soup.find_all ('span' Class_='p-label') if position: position = position [0] .text else: position ='#-- number of warehouses, number of stars, followers, following overview = soup.find_all ('span' Class_='Counter') num_repos = self.str2int (overview [0] .text) num_stars = self.str2int (overview [2] .text) num_followers = self.str2int (overview [3] .text) num_followings = self.str2int (overview [4] .text) #-contribution (last year) num_contributions = soup.find_all ('h3') Class_='f4 text-normal mb-2') num_contributions = self.str2int (num_contributions [0] .text.replace ('\ ncow,'). Replace ('',''). \ replace ('contributioninthelastyear','). Replace ('contributionsinthelastyear',') #-Save data info = [username, position, num_repos, num_stars, num_followers, num_followings, num_contributions] print (info) follower_ infos [str (idx)] = info except: pass time.sleep (random.random () + random.randrange (0) 2) data visualization
Here, take our own fan data as an example, about 1200.
Let's take a look at the distribution of the number of code they submitted over the past year:
The person with the most submissions is fengjixuchui, which has been submitted 9437 times in the past year. On average, I have to submit it more than 20 times a day, which is too diligent.
Let's take a look at the distribution of the number of warehouses owned by each person:
I thought it would be a monotonous curve, but I seem to underestimate you.
Next, let's take a look at the quantity distribution of star others:
It's okay, at least not all of them are "whoring for nothing".
. To praise the brother named lifa123, he gave 18700 to others, which is too show.
Let's take a look at the distribution of the number of fans of more than 1000 people:
This is the answer to the question about how to use Github user data crawler in Python. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.