Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to crawl articles on the website where everyone is a product manager

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article is about how to use Python to crawl the website where everyone is a product manager. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1.1. Why choose "everyone is a product manager"

Everyone is a product manager, a learning, communication and sharing platform with product managers and operations as the core, with media, training, recruitment and community as one, full-service product people and operators, and holding 500 + online lectures for 8 years. 300 offline sharing meetings, 20 product managers' meetings and operation conferences, covering 15 cities, including Beishang, Guangzhou, Shenzhen, Hangzhou and Chengdu, have high influence and popularity in the industry. Platform gathered a large number of BAT Meituan JD.com Didi 360Xiaomi NetEase and other well-known Internet companies product directors and operations directors. Choose this community to be more representative.

1.2. Analysis content

Analyze the basic situation of 6574 articles under the product manager column, including the number of collections, comments, likes, etc.

Discover the most popular articles and authors

Analysis of the relationship between title length and popularity of articles

Show what the product manager is looking at

1.3. Analysis tool

Python 3.6

Matplotlib

WordCloud

Jieba

two。 Data capture

The crawler written in Python grabbed all the articles under the product managers section of the product manager community where everyone is a product manager and saved them in csv format, with a total of 6574 articles from June 2012 to January 21, 2019. Crawled 10 field information: article title, author, author profile, post time, pageviews, collections, likes, comments, text, article links.

2.1. Target website analysis

This is the web interface to be crawled. You can see that it is loaded directly. Without AJAX, it is not difficult to crawl.

If we take a closer look at the pages to be crawled, we can see that there are regular page links to follow, and the parameter after page in the connection is the number of pages, so when we write a crawler, we can directly use the for loop to construct all page link codes as follows:

1import requests

2from bs4 import BeautifulSoup

3import csv

four

5headers = {'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'

6 'Accept-Language':' zh-CN,zh;q=0.9,en;q=0.8'

7 'Cache-Control':' max-age=0'

8 'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'

9 'Connection':' keep-alive'

10 'Host':' www.woshipm.com'

11 'Cookie':' taccounMHpOYzlnMmp6dkFJTEVmS3pDeldrSWRTazlBOXpkRjBzRXOU4yVkNZWWl5QVhVXBjMU5WcnpwQ2NCQS90ZkVsZ3lTU2Z0T3puVZFWFRFOXR1TnVrbUV2UV2FlsQWxemY4NG1wWFRYMMENdDRPQQ1psK0NFZGJDZ0lN3BQZmo% 3DMZM0Njg4NDkxLCxwwxNTQMTk0MZ5LCcxHcLov3Y0RYb53NoaBXMLv9WFRYNMXE4MD8MXE8MXE8MXE8MXE8MXE4WFRYNVdDRPdDRPQ1ps0K0NFZGJDZ0lN3BQZM0MU5WcnpQ2NCQS90ZkMU3TU2Z0T3puVZFWWxemY4NG1wWFRYMENVdDRPQ1psK0NFZGJDZ0lN3BQZM0MU5Wcng4NDkxLCxWxBjMU5WcnpQ2NCQS90ZkVsZ3lTU2Z0T3puVZFWFRFOXR1wWFRYMENVdDRPdDRPQ1psK0NFZGJDZ0lsN3BQZmo3DMZO4MD8MXE4MXE8MXE8MXE8MXE8MXE8MXE4MxMxMxMxMxNxMxMxMxMxNxMxMxMxMXMXMXMXMXMXMXMXMXMXMX

12}

13for page_number in range (1549):

14page_url = "http://www.woshipm.com/category/pmd/page/{}".format(page_number)

15print ('grabbing' + str (page_number) + 'page > >')

16response = requests.get (url=page_url, headers=headers)

After the page link is constructed, we can start to crawl the article details page and extract the required information. The parsing library used here is BeautifulSoup. The whole crawler is very simple. The complete code is as follows:

1#!/usr/bin/env python

Answer-*-encoding: utf-8-*-

three

4import requests

5from bs4 import BeautifulSoup

6import csv

seven

8headers = {'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'

9 'Accept-Language':' zh-CN,zh;q=0.9,en;q=0.8'

10 'Cache-Control':' max-age=0'

11 'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'

12 'Connection':' keep-alive'

13 'Host':' www.woshipm.com'

14 'Cookie':' taccounMHpOYzlnMmp6dkFJTEVmS3pDeldrSWRTazlBOXpkRjBzRXOU4yVkNZWWl5QVhVXBjMU5WcnpwQ2NCQS90ZkVsZ3lTU2Z0T3puVZFWFRFOXR1TnVrbUV2UV2FlsQWxemY4NG1wWFRYNMENdDRPQQ1psK0NFZGJDZ0lN3BQZmo% 3DMZM0Njg4NDkxLCxWxNTQMTk0MZ5LCcxHcLov3Y0RYb53NoaBXMLv9WFRYNOMXE4MD8MXE8MXE8MXE8MXE4WFRYNVdDRPdDRPQ1K0K0NFZGJDZ0lN3BQZM0MU5Wcng4NDkLCxWxNxMU5WcxxHCQ2NCQFWFRFOXR1TnVrbUV2UV2FlsQWxemY4NG1wWFRYNMENVdDRPQ1psK0NFZGJDZ0lN3BQZmo3DMZO4DNFZGJDZ0lN3BQZMU5WcnpQ2NCQS90ZkMU3TU2Z0T3puVZFWFRFOXR1wWFRYMWxemY4NG1wWFRYMENVdDRPQ1psK0NFZGJDZ0lN3BQZmo3DMZO4DNjg4NFZGJDZ0lN3BQZMU5WccxxHCQ2NCQS90ZkVZ3TU2Z0T3puVZFWF

15}

16with open ('data.csv',' walled, encoding='utf-8',newline='') as csvfile:

17 fieldnames = ['title',' author', 'author_des',' date', 'views',' loves', 'zans',' comment_num','art', 'url']

18 writer = csv.DictWriter (csvfile, fieldnames=fieldnames)

19 writer.writeheader ()

20 for page_number in range (1549):

21 page_url = "http://www.woshipm.com/category/pmd/page/{}".format(page_number)

22 print ('capturing' + str (page_number) + 'page > >')

23 response = requests.get (url=page_url, headers=headers)

24 if response.status_code = = 200:

25 page_data = response.text

26 if page_data:

27 soup = BeautifulSoup (page_data, 'lxml')

28 article_urls = soup.find_all ("h3", class_= "post-title")

29 for item in article_urls:

thirty

31 url = item.find ('a') .get ('href')

32 # article page analysis, get article title, author, author profile, date, page views, favorites, likes, comments, text, article links

33 response = requests.get (url=url, headers=headers)

34 # time.sleep (3)

35 print ('crawling:' + url)

36 # print (response.status_code)

37 if response.status_code = = 200:

38 article = response.text

39 # print (article)

40 if article:

41 try:

42 soup = BeautifulSoup (article, 'lxml')

43 # Article title

44 title = soup.find (class_='article-title'). Get_text (). Strip ()

45 # author

46 author = soup.find (class_='post-meta-items'). Find_previous_siblings () [1]. Find ('a'). Get_text (). Strip ()

Brief introduction of 47 # author

48 author_des = soup.find (class_='post-meta-items') .find_previous_siblings () [0] .get _ text () .strip ()

49 # date

50 date = soup.find (class_='post-meta-items') .find_all (class_='post-meta-item') [0] .get _ text () .strip ()

51 # pageviews

52 views = soup.find (class_='post-meta-items') .find_all (class_='post-meta-item') [1] .get _ text () .strip ()

53 # Collection

54 loves = soup.find (class_='post-meta-items') .find_all (class_='post-meta-item') [2] .get _ text () .strip ()

55 # praise quantity

56 zans = soup.find (class_='post-meta-items') .find_all (class_='post-meta-item') [3] .get _ text () .strip ()

57 # comments

58 comment = soup.find ('ol', class_= "comment-list") .find_all (' li')

59 comment_num = len (comment)

60 # text

61 art = soup.find (class_= "grap"). Get_text (). Strip ()

sixty-two

63 writer.writerow ({'title':title,' author':author, 'author_des':author_des,' date':date, 'views':views,' loves':int (loves), 'zans':int (zans),' comment_num':int (comment_num), 'art':art,' url':url})

64 print ({'title':title,' author':author, 'author_des':author_des,' date':date, 'views':views,' loves':loves, 'zans':zans,' comment_num':comment_num})

65 except:

66 print ('crawl failed')

67 print ("crawl complete!")

To talk about one point here, crawl the number of comments, and observe the details page of the article. You can find that there is no comment number. I calculated it directly here. You can see that the comments are nested in the ol, grab all the li, and then you can calculate it. The code is as follows:

1 # comments

2 comment = soup.find ('ol', class_= "comment-list") .find_all (' li')

3 comment_num = len (comment)

In this way, we can run the crawler to successfully crawl the results of page 594. I have grabbed a total of 6574 results here, and I will probably finish eating chicken after playing two games.

Above, complete the acquisition of the data. With the data, we can start to analyze, but before this, we still need to simply clean and process the data.

3. Data cleaning and processing

First, we need to convert the csv file to DataFrame.

"convert csv data to dataframe

2csv_file = "data.csv"

3csv_data = pd.read_csv (csv_file, low_memory=False) # prevent pop-up warning

4csv_df = pd.DataFrame (csv_data)

5print (csv_df)

Let's take a look at the overall situation of the data, and we can see that the dimension of the data is 6574 rows × 10 columns. You need to change the views column to numeric format and the date column to date format.

1print (csv_df.shape) # View the number of rows and columns

2print (csv_df.info ()) # to view the overall situation

3print (csv_df.head ()) # output the first five lines

Run the result

5 (6574, 10)

six

7RangeIndex: 6574 entries, 0 to 6573

8Data columns (total 10 columns):

9title 6574 non-null object

10author 6574 non-null object

11author_des 6135 non-null object

12date 6574 non-null object

13views 6574 non-null object

14loves 6574 non-null int64

15zans 6574 non-null int64

16comment_num 6574 non-null int64

17art 6574 non-null object

18url 6574 non-null object

19dtypes: int64 (3), object (7)

20memory usage: 513.7 + KB

21None

22 title... Url

This is how I spent the second year of my product career at 2018. Http://www.woshipm.com/pmd/1863343.html

241 A product trilogy extracted from what is Page? Http://www.woshipm.com/pmd/1860832.html

252 "excavation, filling", those things about the project (phase 6: test and acceptance). Http://www.woshipm.com/pmd/1859168.html

How to become a trusted Product Manager of CEO? ... Http://www.woshipm.com/pmd/1857656.html

274 how to get programmers to put down their knives? ... Http://www.woshipm.com/pmd/1858879.html

twenty-eight

29 [5 rows x 10 columns]

Changing the date column to a date is very simple, and the code is as follows:

Modify the date column time and convert it to datetime format

2csv_df ['date'] = pd.to_datetime (csv_df [' date'])

The idea of views column processing is to add a column, which is called views_num. We can observe that some values of views column are integers, and some are 17000. The code is as follows:

1#!/usr/bin/env python

Answer-*-encoding: utf-8-*-

three

4import pandas as pd

5import numpy as np

6import matplotlib.pyplot as plt

7import seaborn as sns

8import re

9from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

10import jieba

11import os

12from PIL import Image

13from os import path

14from decimal import *

fifteen

Views column processing

17def views_to_num (item):

18 m = re.search (. *? (Wan)', item ['views'])

19 if m:

20 ns = item ['views'] [:-1]

21 nss = Decimal (ns) * 10000

22 else:

23 nss = item ['views']

24 return int (nss)

twenty-five

2. Data cleaning processing

27def parse_woshipm ():

28 # convert csv data to dataframe

29 csv_file = "data.csv"

30 csv_data = pd.read_csv (csv_file, low_memory=False) # prevent pop-up warning

31 csv_df = pd.DataFrame (csv_data)

32 # print (csv_df.shape) # View the number of rows and columns

33 # print (csv_df.info ()) # View the overall situation

34 # print (csv_df.head ()) # output the first five lines

thirty-five

36 # modify date column time and convert it to datetime format

37 csv_df ['date'] = pd.to_datetime (csv_df [' date'])

38 # digitize the views string and add a column of views_num

39 csv_df ['views_num'] = csv_df.apply (views_to_num,axis = 1)

40 print (csv_df.info ())

forty-one

forty-two

43if _ _ name__ = ='_ _ main__':

44 parse_woshipm ()

Let's take another look at the column data types:

one

2RangeIndex: 6574 entries, 0 to 6573

3Data columns (total 11 columns):

4title 6574 non-null object

5author 6574 non-null object

6author_des 6135 non-null object

7date 6574 non-null datetime64 [ns]

8views 6574 non-null object

9loves 6574 non-null int64

10zans 6574 non-null int64

11comment_num 6574 non-null int64

12art 6574 non-null object

13url 6574 non-null object

14views_num 6574 non-null int64

15dtypes: datetime64 [ns] (1), int64 (4), object (6)

16memory usage: 565.0 + KB

17None

You can see that the data type has become what we want. Next, let's see if the data is duplicated, and if so, it needs to be deleted.

Check whether the whole row has duplicate values. If the running result is True, it indicates that there are duplicate values.

Destroy print (any (csv_df.duplicated ()

True is displayed, indicating that there are duplicate values, and the number of duplicate values is further extracted

4data_duplicated = csv_df.duplicated () .value_counts ()

Optional print (data_duplicated)

Run the result

7# True

8# False

9# 6562

10# True

11# 12

12# dtype: int64

1' Delete duplicate values

14data = csv_df.drop_duplicates (keep='first')

"after deleting some lines, index is interrupted and index needs to be reset.

16data = data.reset_index (drop=True)

Then, we add two columns of data, one is the length column of the article title, and the other is the year column, which is convenient for later analysis.

"increase the title length column and year column

2data ['title_length'] = data [' title'] .apply (len)

3data ['year'] = data [' date'] .dt.year

Above, the basic data cleaning process is completed, and the data can be analyzed.

4. Descriptive data analysis

In general, data analysis is mainly divided into four categories: "descriptive analysis", "diagnostic analysis", "predictive analysis" and "normative analysis". "descriptive analysis" is a statistical method used to summarize and express the overall situation of things and the correlation and generic relationship between things. it is the most common type of data analysis among these four categories. Through statistical processing, we can succinctly use several statistical values to represent the concentration (such as average, median, mode, etc.) and discrete type (reflecting the volatility of the data, such as variance, standard deviation, etc.) of a group of data.

Here, we mainly carry out descriptive analysis, and the data are mainly numerical data (including discrete variables and continuous variables) and text data.

4.1. Overall situation

First, let's take a look at the overall situation, using the data.describe () method for statistical analysis of numeric variables.

Mean stands for average and std for standard deviation, from which the following conclusions can be drawn briefly:

The product manager loves to learn and keeps good articles when he sees them. 75% of the articles collected exceeded 100, and 50% of the articles received more than 100 views.

There are few words about the products, and they seldom comment on other people's articles. There are very few comments on the article.

The product is unwilling to admit that others are better than themselves. Most of the articles have 10 or 20 likes, so programmers should not brag about how good the technology is in front of the product, the product will not admit that you are good.

For non-numeric variables (author, date), using the describe () method produces another kind of summary statistics.

1print (data ['author'] .resume ())

2print (data ['date'] .resume ())

The result

4count 6562

5unique 1531

6top Nairo

7freq 315

8Name: author, dtype: object

9count 6562

10unique 1827

11top 2015-01-29 00:00:00

12freq 16

13first 2012-11-25 00:00:00

14last 2019-01-21 00:00:00

15Name: date, dtype: object

Unique represents the number of unique values, top represents the variable with the most occurrences, and freq represents the number of occurrences of the variable, so you can simply draw the following conclusions:

A total of 1531 authors contributed articles to the product manager section of the community, of which the author with the largest contribution was Nairo, who contributed 315 articles

On January 29, 2015, the number of articles published in the column reached the largest, reaching 16. The first article in the column was released on November 25, 2012.

4.2. Changes in the number of articles published in different periods

As can be seen from the picture, the number of articles posted on the website increased year by year from 2012 to 2015, which may be related to the increase in the popularity of the site; it was relatively stable after the second quarter of 2015. The later analysis code will not be posted one by one, and the code download link will be left at the end of the article.

4.3. The number of views of the article TOP10

Next, it comes to the question that we are more concerned about: among the tens of thousands of articles, which articles are better or more popular?

Here, measured by the number of readers, the first is "Xiaobai product manager looks at the product: what is the Internet product?" the number of views in the first place is far ahead of the second place, close to one million. It seems that many communities are product rookies. And if you look at the titles of these articles, they all seem to introduce what the product manager is and what the product manager does. It seems that there are many primary products in the community.

4.4. The number of articles collected over the years TOP3

After understanding the overall ranking of articles, let's take a look at the ranking of articles over the years. Here, the three articles with the largest collection are selected each year.

As can be seen from the picture, the collection of the article in 2015 is the largest, reaching 2000, and the content of the article is background product design. It seems that this article is full of practical information.

4.4.1. The most prolific author TOP20

Above, we analyze the collection index, and below, let's take a look at the author who published the article. The most published articles mentioned above are Nairo, which contributed 315 articles. Here, let's take a look at the more productive authors.

You can see that the first one is far ahead, is a werewolf, you can pay attention to these high-quality authors.

4.4.2. Average number of articles collected by the largest number of authors TOP 10

We focus on an author not only because of the high yield of the article, but also because of the quality of the article. Here we choose the index of "average collection of articles" (total collection / number of articles) to see who are the authors with high level of articles. Here, in order to avoid the situation that "an author has written only one article with a high collection rate" does not represent his true level, we limit the screening to authors who have published at least 5 articles.

Comparing this picture with the previous ranking of the number of posts, we can find that the author of this picture is not on the list, and quality may be more important than quantity.

4.5. TOP10 has the largest number of comments on articles

When we're done, the collection. Next, let's take a look at the articles with the largest number of comments.

We can see that most of them are related to primary products, and we can see that there are a lot of reviews and collections, and we further explore the relationship between the two.

We can find that the number of comments and collections of most articles are very small.

4.6. Article title length

Next, let's see if there is any relationship between the length of the title and the amount of reading.

We can see that the amount of reading is generally high when the title length of the article is about 20.

4.7. Text analysis

Finally, let's take a look at what the product manager is looking at from the text of these 50,000 articles.

Thank you for reading! This is the end of this article on "how to use Python to climb the website article that everyone is a product manager". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report