In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article is about how to use Python to crawl the website where everyone is a product manager. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
1.1. Why choose "everyone is a product manager"
Everyone is a product manager, a learning, communication and sharing platform with product managers and operations as the core, with media, training, recruitment and community as one, full-service product people and operators, and holding 500 + online lectures for 8 years. 300 offline sharing meetings, 20 product managers' meetings and operation conferences, covering 15 cities, including Beishang, Guangzhou, Shenzhen, Hangzhou and Chengdu, have high influence and popularity in the industry. Platform gathered a large number of BAT Meituan JD.com Didi 360Xiaomi NetEase and other well-known Internet companies product directors and operations directors. Choose this community to be more representative.
1.2. Analysis content
Analyze the basic situation of 6574 articles under the product manager column, including the number of collections, comments, likes, etc.
Discover the most popular articles and authors
Analysis of the relationship between title length and popularity of articles
Show what the product manager is looking at
1.3. Analysis tool
Python 3.6
Matplotlib
WordCloud
Jieba
two。 Data capture
The crawler written in Python grabbed all the articles under the product managers section of the product manager community where everyone is a product manager and saved them in csv format, with a total of 6574 articles from June 2012 to January 21, 2019. Crawled 10 field information: article title, author, author profile, post time, pageviews, collections, likes, comments, text, article links.
2.1. Target website analysis
This is the web interface to be crawled. You can see that it is loaded directly. Without AJAX, it is not difficult to crawl.
If we take a closer look at the pages to be crawled, we can see that there are regular page links to follow, and the parameter after page in the connection is the number of pages, so when we write a crawler, we can directly use the for loop to construct all page link codes as follows:
1import requests
2from bs4 import BeautifulSoup
3import csv
four
5headers = {'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
6 'Accept-Language':' zh-CN,zh;q=0.9,en;q=0.8'
7 'Cache-Control':' max-age=0'
8 'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
9 'Connection':' keep-alive'
10 'Host':' www.woshipm.com'
11 'Cookie':' taccounMHpOYzlnMmp6dkFJTEVmS3pDeldrSWRTazlBOXpkRjBzRXOU4yVkNZWWl5QVhVXBjMU5WcnpwQ2NCQS90ZkVsZ3lTU2Z0T3puVZFWFRFOXR1TnVrbUV2UV2FlsQWxemY4NG1wWFRYMMENdDRPQQ1psK0NFZGJDZ0lN3BQZmo% 3DMZM0Njg4NDkxLCxwwxNTQMTk0MZ5LCcxHcLov3Y0RYb53NoaBXMLv9WFRYNMXE4MD8MXE8MXE8MXE8MXE8MXE4WFRYNVdDRPdDRPQ1ps0K0NFZGJDZ0lN3BQZM0MU5WcnpQ2NCQS90ZkMU3TU2Z0T3puVZFWWxemY4NG1wWFRYMENVdDRPQ1psK0NFZGJDZ0lN3BQZM0MU5Wcng4NDkxLCxWxBjMU5WcnpQ2NCQS90ZkVsZ3lTU2Z0T3puVZFWFRFOXR1wWFRYMENVdDRPdDRPQ1psK0NFZGJDZ0lsN3BQZmo3DMZO4MD8MXE4MXE8MXE8MXE8MXE8MXE8MXE4MxMxMxMxMxNxMxMxMxMxNxMxMxMxMXMXMXMXMXMXMXMXMXMXMX
12}
13for page_number in range (1549):
14page_url = "http://www.woshipm.com/category/pmd/page/{}".format(page_number)
15print ('grabbing' + str (page_number) + 'page > >')
16response = requests.get (url=page_url, headers=headers)
After the page link is constructed, we can start to crawl the article details page and extract the required information. The parsing library used here is BeautifulSoup. The whole crawler is very simple. The complete code is as follows:
1#!/usr/bin/env python
Answer-*-encoding: utf-8-*-
three
4import requests
5from bs4 import BeautifulSoup
6import csv
seven
8headers = {'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
9 'Accept-Language':' zh-CN,zh;q=0.9,en;q=0.8'
10 'Cache-Control':' max-age=0'
11 'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
12 'Connection':' keep-alive'
13 'Host':' www.woshipm.com'
14 'Cookie':' taccounMHpOYzlnMmp6dkFJTEVmS3pDeldrSWRTazlBOXpkRjBzRXOU4yVkNZWWl5QVhVXBjMU5WcnpwQ2NCQS90ZkVsZ3lTU2Z0T3puVZFWFRFOXR1TnVrbUV2UV2FlsQWxemY4NG1wWFRYNMENdDRPQQ1psK0NFZGJDZ0lN3BQZmo% 3DMZM0Njg4NDkxLCxWxNTQMTk0MZ5LCcxHcLov3Y0RYb53NoaBXMLv9WFRYNOMXE4MD8MXE8MXE8MXE8MXE4WFRYNVdDRPdDRPQ1K0K0NFZGJDZ0lN3BQZM0MU5Wcng4NDkLCxWxNxMU5WcxxHCQ2NCQFWFRFOXR1TnVrbUV2UV2FlsQWxemY4NG1wWFRYNMENVdDRPQ1psK0NFZGJDZ0lN3BQZmo3DMZO4DNFZGJDZ0lN3BQZMU5WcnpQ2NCQS90ZkMU3TU2Z0T3puVZFWFRFOXR1wWFRYMWxemY4NG1wWFRYMENVdDRPQ1psK0NFZGJDZ0lN3BQZmo3DMZO4DNjg4NFZGJDZ0lN3BQZMU5WccxxHCQ2NCQS90ZkVZ3TU2Z0T3puVZFWF
15}
16with open ('data.csv',' walled, encoding='utf-8',newline='') as csvfile:
17 fieldnames = ['title',' author', 'author_des',' date', 'views',' loves', 'zans',' comment_num','art', 'url']
18 writer = csv.DictWriter (csvfile, fieldnames=fieldnames)
19 writer.writeheader ()
20 for page_number in range (1549):
21 page_url = "http://www.woshipm.com/category/pmd/page/{}".format(page_number)
22 print ('capturing' + str (page_number) + 'page > >')
23 response = requests.get (url=page_url, headers=headers)
24 if response.status_code = = 200:
25 page_data = response.text
26 if page_data:
27 soup = BeautifulSoup (page_data, 'lxml')
28 article_urls = soup.find_all ("h3", class_= "post-title")
29 for item in article_urls:
thirty
31 url = item.find ('a') .get ('href')
32 # article page analysis, get article title, author, author profile, date, page views, favorites, likes, comments, text, article links
33 response = requests.get (url=url, headers=headers)
34 # time.sleep (3)
35 print ('crawling:' + url)
36 # print (response.status_code)
37 if response.status_code = = 200:
38 article = response.text
39 # print (article)
40 if article:
41 try:
42 soup = BeautifulSoup (article, 'lxml')
43 # Article title
44 title = soup.find (class_='article-title'). Get_text (). Strip ()
45 # author
46 author = soup.find (class_='post-meta-items'). Find_previous_siblings () [1]. Find ('a'). Get_text (). Strip ()
Brief introduction of 47 # author
48 author_des = soup.find (class_='post-meta-items') .find_previous_siblings () [0] .get _ text () .strip ()
49 # date
50 date = soup.find (class_='post-meta-items') .find_all (class_='post-meta-item') [0] .get _ text () .strip ()
51 # pageviews
52 views = soup.find (class_='post-meta-items') .find_all (class_='post-meta-item') [1] .get _ text () .strip ()
53 # Collection
54 loves = soup.find (class_='post-meta-items') .find_all (class_='post-meta-item') [2] .get _ text () .strip ()
55 # praise quantity
56 zans = soup.find (class_='post-meta-items') .find_all (class_='post-meta-item') [3] .get _ text () .strip ()
57 # comments
58 comment = soup.find ('ol', class_= "comment-list") .find_all (' li')
59 comment_num = len (comment)
60 # text
61 art = soup.find (class_= "grap"). Get_text (). Strip ()
sixty-two
63 writer.writerow ({'title':title,' author':author, 'author_des':author_des,' date':date, 'views':views,' loves':int (loves), 'zans':int (zans),' comment_num':int (comment_num), 'art':art,' url':url})
64 print ({'title':title,' author':author, 'author_des':author_des,' date':date, 'views':views,' loves':loves, 'zans':zans,' comment_num':comment_num})
65 except:
66 print ('crawl failed')
67 print ("crawl complete!")
To talk about one point here, crawl the number of comments, and observe the details page of the article. You can find that there is no comment number. I calculated it directly here. You can see that the comments are nested in the ol, grab all the li, and then you can calculate it. The code is as follows:
1 # comments
2 comment = soup.find ('ol', class_= "comment-list") .find_all (' li')
3 comment_num = len (comment)
In this way, we can run the crawler to successfully crawl the results of page 594. I have grabbed a total of 6574 results here, and I will probably finish eating chicken after playing two games.
Above, complete the acquisition of the data. With the data, we can start to analyze, but before this, we still need to simply clean and process the data.
3. Data cleaning and processing
First, we need to convert the csv file to DataFrame.
"convert csv data to dataframe
2csv_file = "data.csv"
3csv_data = pd.read_csv (csv_file, low_memory=False) # prevent pop-up warning
4csv_df = pd.DataFrame (csv_data)
5print (csv_df)
Let's take a look at the overall situation of the data, and we can see that the dimension of the data is 6574 rows × 10 columns. You need to change the views column to numeric format and the date column to date format.
1print (csv_df.shape) # View the number of rows and columns
2print (csv_df.info ()) # to view the overall situation
3print (csv_df.head ()) # output the first five lines
Run the result
5 (6574, 10)
six
7RangeIndex: 6574 entries, 0 to 6573
8Data columns (total 10 columns):
9title 6574 non-null object
10author 6574 non-null object
11author_des 6135 non-null object
12date 6574 non-null object
13views 6574 non-null object
14loves 6574 non-null int64
15zans 6574 non-null int64
16comment_num 6574 non-null int64
17art 6574 non-null object
18url 6574 non-null object
19dtypes: int64 (3), object (7)
20memory usage: 513.7 + KB
21None
22 title... Url
This is how I spent the second year of my product career at 2018. Http://www.woshipm.com/pmd/1863343.html
241 A product trilogy extracted from what is Page? Http://www.woshipm.com/pmd/1860832.html
252 "excavation, filling", those things about the project (phase 6: test and acceptance). Http://www.woshipm.com/pmd/1859168.html
How to become a trusted Product Manager of CEO? ... Http://www.woshipm.com/pmd/1857656.html
274 how to get programmers to put down their knives? ... Http://www.woshipm.com/pmd/1858879.html
twenty-eight
29 [5 rows x 10 columns]
Changing the date column to a date is very simple, and the code is as follows:
Modify the date column time and convert it to datetime format
2csv_df ['date'] = pd.to_datetime (csv_df [' date'])
The idea of views column processing is to add a column, which is called views_num. We can observe that some values of views column are integers, and some are 17000. The code is as follows:
1#!/usr/bin/env python
Answer-*-encoding: utf-8-*-
three
4import pandas as pd
5import numpy as np
6import matplotlib.pyplot as plt
7import seaborn as sns
8import re
9from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
10import jieba
11import os
12from PIL import Image
13from os import path
14from decimal import *
fifteen
Views column processing
17def views_to_num (item):
18 m = re.search (. *? (Wan)', item ['views'])
19 if m:
20 ns = item ['views'] [:-1]
21 nss = Decimal (ns) * 10000
22 else:
23 nss = item ['views']
24 return int (nss)
twenty-five
2. Data cleaning processing
27def parse_woshipm ():
28 # convert csv data to dataframe
29 csv_file = "data.csv"
30 csv_data = pd.read_csv (csv_file, low_memory=False) # prevent pop-up warning
31 csv_df = pd.DataFrame (csv_data)
32 # print (csv_df.shape) # View the number of rows and columns
33 # print (csv_df.info ()) # View the overall situation
34 # print (csv_df.head ()) # output the first five lines
thirty-five
36 # modify date column time and convert it to datetime format
37 csv_df ['date'] = pd.to_datetime (csv_df [' date'])
38 # digitize the views string and add a column of views_num
39 csv_df ['views_num'] = csv_df.apply (views_to_num,axis = 1)
40 print (csv_df.info ())
forty-one
forty-two
43if _ _ name__ = ='_ _ main__':
44 parse_woshipm ()
Let's take another look at the column data types:
one
2RangeIndex: 6574 entries, 0 to 6573
3Data columns (total 11 columns):
4title 6574 non-null object
5author 6574 non-null object
6author_des 6135 non-null object
7date 6574 non-null datetime64 [ns]
8views 6574 non-null object
9loves 6574 non-null int64
10zans 6574 non-null int64
11comment_num 6574 non-null int64
12art 6574 non-null object
13url 6574 non-null object
14views_num 6574 non-null int64
15dtypes: datetime64 [ns] (1), int64 (4), object (6)
16memory usage: 565.0 + KB
17None
You can see that the data type has become what we want. Next, let's see if the data is duplicated, and if so, it needs to be deleted.
Check whether the whole row has duplicate values. If the running result is True, it indicates that there are duplicate values.
Destroy print (any (csv_df.duplicated ()
True is displayed, indicating that there are duplicate values, and the number of duplicate values is further extracted
4data_duplicated = csv_df.duplicated () .value_counts ()
Optional print (data_duplicated)
Run the result
7# True
8# False
9# 6562
10# True
11# 12
12# dtype: int64
1' Delete duplicate values
14data = csv_df.drop_duplicates (keep='first')
"after deleting some lines, index is interrupted and index needs to be reset.
16data = data.reset_index (drop=True)
Then, we add two columns of data, one is the length column of the article title, and the other is the year column, which is convenient for later analysis.
"increase the title length column and year column
2data ['title_length'] = data [' title'] .apply (len)
3data ['year'] = data [' date'] .dt.year
Above, the basic data cleaning process is completed, and the data can be analyzed.
4. Descriptive data analysis
In general, data analysis is mainly divided into four categories: "descriptive analysis", "diagnostic analysis", "predictive analysis" and "normative analysis". "descriptive analysis" is a statistical method used to summarize and express the overall situation of things and the correlation and generic relationship between things. it is the most common type of data analysis among these four categories. Through statistical processing, we can succinctly use several statistical values to represent the concentration (such as average, median, mode, etc.) and discrete type (reflecting the volatility of the data, such as variance, standard deviation, etc.) of a group of data.
Here, we mainly carry out descriptive analysis, and the data are mainly numerical data (including discrete variables and continuous variables) and text data.
4.1. Overall situation
First, let's take a look at the overall situation, using the data.describe () method for statistical analysis of numeric variables.
Mean stands for average and std for standard deviation, from which the following conclusions can be drawn briefly:
The product manager loves to learn and keeps good articles when he sees them. 75% of the articles collected exceeded 100, and 50% of the articles received more than 100 views.
There are few words about the products, and they seldom comment on other people's articles. There are very few comments on the article.
The product is unwilling to admit that others are better than themselves. Most of the articles have 10 or 20 likes, so programmers should not brag about how good the technology is in front of the product, the product will not admit that you are good.
For non-numeric variables (author, date), using the describe () method produces another kind of summary statistics.
1print (data ['author'] .resume ())
2print (data ['date'] .resume ())
The result
4count 6562
5unique 1531
6top Nairo
7freq 315
8Name: author, dtype: object
9count 6562
10unique 1827
11top 2015-01-29 00:00:00
12freq 16
13first 2012-11-25 00:00:00
14last 2019-01-21 00:00:00
15Name: date, dtype: object
Unique represents the number of unique values, top represents the variable with the most occurrences, and freq represents the number of occurrences of the variable, so you can simply draw the following conclusions:
A total of 1531 authors contributed articles to the product manager section of the community, of which the author with the largest contribution was Nairo, who contributed 315 articles
On January 29, 2015, the number of articles published in the column reached the largest, reaching 16. The first article in the column was released on November 25, 2012.
4.2. Changes in the number of articles published in different periods
As can be seen from the picture, the number of articles posted on the website increased year by year from 2012 to 2015, which may be related to the increase in the popularity of the site; it was relatively stable after the second quarter of 2015. The later analysis code will not be posted one by one, and the code download link will be left at the end of the article.
4.3. The number of views of the article TOP10
Next, it comes to the question that we are more concerned about: among the tens of thousands of articles, which articles are better or more popular?
Here, measured by the number of readers, the first is "Xiaobai product manager looks at the product: what is the Internet product?" the number of views in the first place is far ahead of the second place, close to one million. It seems that many communities are product rookies. And if you look at the titles of these articles, they all seem to introduce what the product manager is and what the product manager does. It seems that there are many primary products in the community.
4.4. The number of articles collected over the years TOP3
After understanding the overall ranking of articles, let's take a look at the ranking of articles over the years. Here, the three articles with the largest collection are selected each year.
As can be seen from the picture, the collection of the article in 2015 is the largest, reaching 2000, and the content of the article is background product design. It seems that this article is full of practical information.
4.4.1. The most prolific author TOP20
Above, we analyze the collection index, and below, let's take a look at the author who published the article. The most published articles mentioned above are Nairo, which contributed 315 articles. Here, let's take a look at the more productive authors.
You can see that the first one is far ahead, is a werewolf, you can pay attention to these high-quality authors.
4.4.2. Average number of articles collected by the largest number of authors TOP 10
We focus on an author not only because of the high yield of the article, but also because of the quality of the article. Here we choose the index of "average collection of articles" (total collection / number of articles) to see who are the authors with high level of articles. Here, in order to avoid the situation that "an author has written only one article with a high collection rate" does not represent his true level, we limit the screening to authors who have published at least 5 articles.
Comparing this picture with the previous ranking of the number of posts, we can find that the author of this picture is not on the list, and quality may be more important than quantity.
4.5. TOP10 has the largest number of comments on articles
When we're done, the collection. Next, let's take a look at the articles with the largest number of comments.
We can see that most of them are related to primary products, and we can see that there are a lot of reviews and collections, and we further explore the relationship between the two.
We can find that the number of comments and collections of most articles are very small.
4.6. Article title length
Next, let's see if there is any relationship between the length of the title and the amount of reading.
We can see that the amount of reading is generally high when the title length of the article is about 20.
4.7. Text analysis
Finally, let's take a look at what the product manager is looking at from the text of these 50,000 articles.
Thank you for reading! This is the end of this article on "how to use Python to climb the website article that everyone is a product manager". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.