In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Python how to crawl knowledge and do data analysis, I believe that many inexperienced people are helpless about this, this article summarizes the causes and solutions of the problem, through this article I hope you can solve this problem.
Recently, I used python crawler to grab the personal information of Zhiwu users (public information). After de-duplication, there were 300+ million records. In order to get these data, I accidentally crashed a server…Of course, the main configuration is too low.
With the data in hand, we can't be idle, so we have this analysis report. This report has done some simple data analysis. The main purpose is to practice. Let's watch the fun. Experts don't laugh.
Number of participants: 3,289,329.
Data collection tools: distributed python crawler
Analysis tool: ElasticSearch + Kibana
The analysis angle includes: geographical location, male and female ratio, various rankings, university, activity level, etc.
Note:
All the following analysis results are based on the personal information of these 3 million users that I have captured. It is not an authoritative analysis and is for reference only.
The data capture time is July 2017, and user data will change over time, so the report has a certain timeliness.
To a large extent, the user profile is incomplete, because the user has the right to fill in only part of the information, so the later analysis will screen out the users whose corresponding indicators are empty.
Let's look at some interesting phenomena in the distribution of users.
Know the ratio of men to women?
First of all, let's take a look at the ratio of men and women who know about users. The statistical results of the current sample data are close to 1:1, with slightly more boys. (In fact, there are still a large number of users whose gender is unknown, which I have omitted.)
Blue for boys and red for girls. Specific data are:
Male: 1,202,234, or 51.55 per cent.
Female: 1,129,874, or 48.45 per cent.
Do you know where the users are?
Let's look at the whole country (global?) Where are the people playing:
As can be seen from the above figure, users in first-tier cities account for a large proportion of users, and the north, Guangzhou and Shenzhen are all in the center of the word cloud (the larger the text, the greater the proportion). Let's look at the specific rankings (top ten):
The top ten places where users live are Beijing, Shanghai, Hangzhou, Chengdu, Nanjing, Wuhan, Guangzhou, Shenzhen, Xi'an and Chongqing.
You may find that the number of users in each city in Y coordinate is not many, this is because there are about 2.6 million people who do not fill in the "residence" column ~ The following analysis may also occur when users do not fill in a certain column of information, I will ignore these users to ensure the accuracy of the display chart.
What is the distribution of occupations?
The following shows the mainstream occupation, which is also subject to the occupation filled in the user's personal information:
As can be seen from the above figure, the majority of users are middle school students, and the number of others such as product managers, programmers, operations and HRs is also quite large. Let's look at the specific rankings (top ten):
As can be seen from the above picture, the proportion of "students" among Zhihu users is the highest. Let's remove "students" to see the more serious Zhihu career ranking:
After removing students, the proportion of mainstream occupations from large to small is (top ten): product manager, freelance, programmer, engineer, designer, Tencent, teacher, human resources (HR), operation, lawyer. It can be seen that in addition to the common positions of some Internet companies, teachers and lawyer users also occupy a large proportion of Zhihu.
Below we analyze the mainstream occupations of Zhihu by gender and residence.
Know the gender distribution of mainstream occupations:
The inner circle of the pie chart above shows the proportion of each major occupation in the top ten, and the outer circle shows the proportion of men and women in that occupation, blue for men and red for women. Let's use a histogram to show:
Also blue represents men, red represents women, and the number of occupations in Zhihu decreases from left to right. It can be seen that most mainstream occupations are dominated by men. Eight of the top 10 mainstream occupations have more men than women, with programmers showing a greater gender gap (-_-||| ), with the smallest gap between male and female designers, it seems that the proportion of male and female designers is relatively balanced. Others, such as product managers, freelancers and lawyers, have more men than women. The remaining two occupations in the top 10-teaching and human resources (HR)-have more women than men, with the gender gap in HR being even greater. The ratio of men to women in teachers is not so exaggerated, but there are far more women than men (perhaps because male teachers are less knowledgeable).
After looking at the gender distribution of each occupation, we will use a thermal map to observe the distribution of the mainstream occupation (top five) in each region. The darker the color, the more the number of people in the occupation in that region:
Here I removed the product manager for the convenience of display. You only need to know that the number of product managers is the highest everywhere…I don't know why there are so many product managers. Maybe it is to facilitate the promotion of their products?
As can be seen from the above figure, most of the mainstream occupations are concentrated in Beijing and Shanghai. More accurately, most of them are concentrated in Beijing, but HR is an exception. They are more concentrated in Shanghai. Let's look at other occupations. The cities with the most programmers are Beijing, Shanghai, Guangzhou, Hangzhou and Xiamen. Beijing accounts for a large proportion, simply green black, it seems that Beijing is a programmer's paradise? Shenzhen is not on the list, which makes me very surprised. The cities with the most designers are Beijing, Shanghai, Hangzhou, Shenzhen and Wuhan. Designers are distributed evenly across regions, with a certain number in each city.
Knowledgeable university users
Knowing that middle school students account for a large proportion of users, let's see which schools they come from! The larger the font size in the word cloud, the greater the proportion.
Let's list the detailed rankings:
The results shown above may not be accurate, and a large number of student users may not fill in their own schools. Only from the above picture can be seen that the active university users from large to small are: Zhejiang University, Wuhan University, Huazhong University of Science and Technology, Sun Yat-sen University, Peking University, Shanghai Jiaotong University, Fudan University, Nanjing University, Sichuan University and Tsinghua University.
Since the analysis has reached the schools, let's take a look at the ratio of men to women in various colleges and universities, hehe.
I found an interesting phenomenon, most of the colleges are boys playing Zhihu...
Let's look at which universities have the most praise:
The first is Tongji University, civil engineering, um, which big guy supported it; the second is South China University of Technology, software engineering, I know this, Brother Wheel is Royal Institute of Technology; the third is Chongqing's first engineering corpse training base, huh?? What the hell is this (black question mark); keep looking back, eh…?? College at home?! There is also a Lanzhou University, beef noodle technology major??? WHAT??!!
Do you know that big shots are all so naughty…
This picture doesn't seem to be accurate. It's good that everyone ignores it...
Let's take a look at which colleges and universities in each region are known to be heavy users. The darker the color, the more users the school knows:
The universities with the most knowledge in Beijing are Peking University, Beijing University of Posts and Telecommunications, Communication University of China, Renmin University of China and Tsinghua University.
The universities with the most knowledge in Shanghai are Shanghai Jiaotong University, Fudan University, Tongji University, Shanghai University and Shanghai University of Finance and Economics.
The universities with the most knowledge in Hangzhou are Zhejiang University, Zhejiang University of Technology, Hangzhou University of Electronic Science and Technology, Zhejiang University, Computer Science, Zhejiang University, Software Engineering. Zhejiang University is a heavy user…
Chengdu plays the most well-known universities are: University of Electronic Science and Technology, Sichuan University, Southwest Jiaotong University, University of Electronic Science and Technology, Software Engineering, Sichuan Normal University.
The universities with the most knowledge in Guangzhou are Sun Yat-sen University (SYSU), South China University of Technology (SCUT), South China Agricultural University (SCAU), Guangdong University of Foreign Studies and Guangdong University of Technology.
Let's take a look at how active users are at each university, ranked by the total number of questions answered by users at each school:
The ranking is Wuhan University, Zhejiang University, Sun Yat-sen University, South China University of Technology, Peking University, Huazhong University of Science and Technology, Fudan University, Shanghai Jiaotong University and Northwest University of Agriculture and Forestry Science and Technology.
Well, the university analysis is over, let's take a look at the various rankings of users.
The most popular 100-digit V
The bigger the word cloud in the picture below, the more likes it received:
Let's take another histogram and look at it together:
Zhang Jiawei won the first place indisputably, with 360+ 10,000 likes and horror. Followed by pawn, Tang Que, vczh, fat cat, Zhu Xuan, Seasee Youl, ze ran, Gui Mu Zhi, beans. Two of the top five writers (Zhang Jiawei and Tang Que) are known to be praised. It seems that writers are still very popular in answering questions. Indeed, expression ability is an important support for gaining recognition of opinions.
100 Big V's with the Most Followers
The bigger the word cloud in the picture below, the more followers there are. See if there is a big V you are familiar with?
Again, let's look at it with a histogram:
The top 10 big V with the most attention are Zhang Jiawei, Li Kaifu, Huang Jixin, Zhou Yuan, Zhang Liang, Zhang Xiaobei, Li Miao, Zhu Xuan, Ge Jin and Tian Jishun. These are really big V's, and the number of followers is extremely large. Zhang Jiawei, who has the most followers, has 1.37 million fans (when he grabbed it), and his fans are still rising, with 1.38 million fans now. Tian Jishun also has 570,000 fans at least, and the wheel brother (vczh) is slightly less, ranking 11th.
The 100-digit Big V that answers the most questions
These big V's are very active…The bigger the word cloud, the more questions are answered.
The specific rankings are:
The 10 big V with the most answers are: vczh, Li Dong, Zhao Gang, another sock, within the four seas, M3 small mushroom, kun yu, white cat turning wind, yskin, anal pull out a chainsaw. Microsoft's work seems to be very idle, look at the wheel brother (vczh) brush all day know…
Let's add the number of likes these users have received in Zhihu to see if there is any connection between "number of questions answered" and "number of likes obtained":
From the graph above, we can roughly conclude that the number of questions answered has little to do with the number of likes. In the above picture, only kun yu and vczh answered the same number of questions and received the same number of likes. Although the other users in the top ten of the list received a lot of likes, they were not on the same order of magnitude compared with the ranking of the number of questions they answered. This also illustrates a question from the side. The quality of answering questions is more important. Some high-quality ones will attract users 'likes more easily.
I know that the 100 largest V participating in live
Let's look at an interesting statistic, knowing the 100 most live users and how many live games they have participated in. (Live is a question-and-answer form similar to live broadcast launched by Zhihu. Big V opens a live to share knowledge in his field. Users buy tickets to participate in live, which is a way to realize knowledge)
Let's see how many live shows they've participated in:
The most big V actually participated in 1600+ live games, really have a lot of energy and money, haha.
After reading the above content, do you know how python crawls to know and do data analysis? If you still want to learn more skills or want to know more related content, welcome to pay attention to the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.