Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl 4027 veins?

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

What this article shares with you is about how Python crawls 4027 articles. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Without saying much, let's take a look at it.

Pulse is a real-name workplace social platform. Before climbing the pulse official speech section, about 4027 comments were crawled. This article gives a detailed description of the crawling process, and only makes a visual analysis of the comment content.

Reptile

Crawl target:

Only climb the text part, do not consider the picture.

Press F12 in the browser to open the developer, slide down, and you will see a lot of json files at the beginning of gossip (refresh if not).

Right-click open in new tab, inside is a record, text is followed by comments.

The information we are interested in is the following:

Take a look at the address of each website. It ends with a page= number, so write a loop when you climb, and the number starts at 1 and goes back.

Https://maimai.cn/sdk/web/gossip_list?u=206793936&channel=www&version=4.0.0&_csrf=coAlLvgS-UogpI75vEgHk4O1OQivF2ofLce4&access_token=1.9ff1c9df8547b2b2c62bf58b28e84b97&uid=%22MRlTFjf812rF62rOeDhC6vAirs3A3wL6ApgZu%2Fo1crA%3D%22&token=%22rE8q1xp6fZlxvwygWJn1UFDjrmMXDrSE2tc6uDKNIDZtRErng0FRwvduckWMwYzn8CKuzcDfAvoCmBm7%2BjVysA%3D%3D%22&page=1&jsononly=1

At the beginning of json, there are two parameters, total and remain, which give the remaining number and total number of all visible comments, which can be used as the stop condition of the loop.

But the snag is that not all the comments are visible, and the comments are constantly refreshed, so if you climb one page and cycle to the next page or after trying many times, it will prompt you:

It's nice to have such a hint when you read it directly, but it's not very friendly for the crawler, and you need to add an if judgment.

Probably stepped on all the holes you can step on, so if it goes well, you can only climb a few hundred messages at a time. If you want to climb more, you need to wait for a period of time when the information is almost updated. The code is as follows:

#-*-coding: utf-8-*-

"

Created on Fri Oct 19 18:50:03 2018

"

Import urllib

Import requests

From fake_useragent import UserAgent

Import json

Import pandas as pd

Import time

Import datetime

# comment_api = 'https://maimai.cn/sdk/web/gossip_list?u=206793936&channel=www&version=4.0.0&_csrf=7ZRpwOSi-JHa7JrTECXLA8njznQZVbi7d4Uo&access_token=1.b7e3acc5ef86e51a78f3410f99aa642a&uid=%22MRlTFjf812rF62rOeDhC6vAirs3A3wL6ApgZu%2Fo1crA%3D%22&token=%22xoNo1TZ8k28e0JTNFqyxlxg%2BdL%2BY6jtoUjKZwE3ke2IZ919o%2FAUeOvcX2yA03CAx8CKuzcDfAvoCmBm7%2BjVysA%3D%3D%22&page={}&jsononly=1'

# send get request

Comment_api = 'https://maimai.cn/sdk/web/gossip_list?u=206793936&channel=www&version=4.0.0&_csrf=FfHZIyBb-H4LEs35NcyhyoAvRM7OkMRB0Jpo&access_token=1.0d4c87c687410a15810ee6304e1cd53b&uid=%22MRlTFjf812rF62rOeDhC6vAirs3A3wL6ApgZu%2Fo1crA%3D%22&token=%22G7rGLEqmm1wY0HP4q%2BxpPFCDj%2BHqGJFm0mSa%2BxpqPg47egJdXL%2FriMlMlHuQj%2BgM8CKuzcDfAvoCmBm7%2BjVysA%3D%3D%22&page={}&jsononly=1'

"

Author: author

Text: comments

Cmts: number of comments

Circles_views: number of views

Spread: number of retweets

Likes: number of likes

Time: time

"

Headers = {"User-Agent": UserAgent (verify_ssl=False) .random}

J = 0

K = 0

Response_comment = requests.get (comment_api.format (0), headers = headers)

Json_comment = response_comment.text

Json_comment = json.loads (json_comment)

Num = json_comment ['total']

Cols = ['author','text','cmts','likes','circles_views','spreads','time']

Dataall = pd.DataFrame (index = range (num), columns = cols)

Remain = json_comment ['remain']

Print (remain)

N = json_comment ['count']

For i in range (n):

If json_comment ['data'] [I] [' text']! ='I have seen the following, click here to refresh':

Dataall.loc [j author' author'] = json_comment ['data'] [I] [' author']

Dataall.loc [j text' text'] = json_comment ['data'] [I] [' text']

Dataall.loc [j data' cmts'] = json_comment ['data'] [I] [' cmts']

Dataall.loc [j likes''] = json_comment ['data'] [I] [' likes']

Dataall.loc [j circles_views'] = json_comment ['data'] [I] [' circles_views']

Dataall.loc [j spreads' spreads'] = json_comment ['data'] [I] [' spreads']

Dataall.loc [j time''] = json_comment ['data'] [I] [' time']

Jacks = 1

Else:

K =-1

Break

KTH = 1

Comment_api1 = comment_api.format (k)

Response_comment = requests.get (comment_api1,headers = headers)

Json_comment = response_comment.text

Json_comment = json.loads (json_comment)

Remain = json_comment ['remain']

Print ('completed {}%!' .format (round (jpool numb 100Pol 2)

Time.sleep (3)

Dataall = dataall.dropna ()

Dataall = dataall.drop_duplicates ()

Dataall.to_csv ('data_20181216_part3.csv',index = False)

Data visualization

After intermittently crawling a pile of files to be deduplicated, we got 4027 pieces of data in the following format:

Next, do some simple analysis of the data you climbed to. Because it is not a full comment, it is only a small sample, so the result must be biased, but the climbing time is very random, and it has been climbing back and forth for more than two weeks, so the selection is relatively random and representative.

There are two types of speakers in the pulse, one is completely anonymous, using system-generated nicknames, and the other is displayed as xx employees. We count the number of these two types of users and the number of posts in the sample. Among the 4027 official statements, there are a total of 1100 people who posted different posts.

Anonymous posters are more than 70%, and no one is willing to speak with their real identities. After all, the risk of being human flesh by the company / school is still very high.

The number of posts is not surprising, anonymous posters contributed more than 85% of the posts.

Anonymous posters can not get more detailed data, but for those who are not anonymous, they can get the information of their company, and summarize the number of posts by company, looking at the number of posts made by major enterprises, which can be used as an estimate as a whole. Statistics have taken into account the inconsistent input of company names, replacing Ant Financial Services Group and Alipay with Alibaba, JD.com Finance with JD.com, Jinri Toutiao and Douyin with byte beats, and taking the number of posts TOP20.

It can be seen that most of the posters come from Internet enterprises, while finance, real estate and other enterprises are relatively few.

Text analysis

For retweets, comments and likes, it is difficult to compare directly because of the difference in crawling time. Give the top five comments with the most comments and see what topics you are most willing to participate in.

1. Summarize your 2018 in one word. (1659 comments)

2. [re-employment post] I am a newly optimized Zhihu programmer who has worked for 3 years. Rather want to go to BAT and other big factories, hope that your factory HR with company certification to reply, send a real hc positions, wish brothers can find new jobs.

3. Summarize your current job in two words.

4. NetEase's salary has increased by 50% this year. Is the company rich? (458 comments)

5. Summarize your work in two words. (415 comments)

1, 4 and 5 are all interesting questions. Let's climb down the comments of 1, 4 and 5 and make them into word clouds to see what everyone is talking about.

Summarize your 2018 in one word

The crawler process is basically the same as above, finding json, but this can climb to all the comments.

#-*-coding: utf-8-*-

"

Created on Fri Oct 19 18:50:03 2018

"

Import urllib

Import requests

From fake_useragent import UserAgent

Import json

Import pandas as pd

Import time

Comment_api = 'https://maimai.cn/sdk/web/gossip/getcmts?gid=18606987&page={}&count=50&hotcmts_limit_count=1&u=206793936&channel=www&version=4.0.0&_csrf=38244DlN-X0iNIk6A4seLXFx6hz3Ds6wfQ0Y&access_token=1.9ff1c9df8547b2b2c62bf58b28e84b97&uid=%22MRlTFjf812rF62rOeDhC6vAirs3A3wL6ApgZu%2Fo1crA%3D%22&token=%22rE8q1xp6fZlxvwygWJn1UFDjrmMXDrSE2tc6uDKNIDZtRErng0FRwvduckWMwYzn8CKuzcDfAvoCmBm7%2BjVysA%3D%3D%22'

"

Author: author

Text: comments

"

# headers = {"User-Agent": UserAgent (verify_ssl=False) .random, 'Cookie':cookie}

Headers = {"User-Agent": UserAgent (verify_ssl=False) .random}

J = 0

K = 0

Response_comment = requests.get (comment_api.format (0), headers = headers)

Json_comment = response_comment.text

Json_comment = json.loads (json_comment)

Num = json_comment ['total']

Cols = ['author','text']

Dataall = pd.DataFrame (index = range (num), columns = cols)

While j < num:

N = json_comment ['count']

For i in range (n):

Dataall.loc [j name' author'] = json_comment ['comments'] [I] [' name']

Dataall.loc [j text' text'] = json_comment ['comments'] [I] [' text']

K + = 1

Comment_api1 = comment_api.format (k)

Response_comment = requests.get (comment_api1,headers = headers)

Json_comment = response_comment.text

Json_comment = json.loads (json_comment)

Print ('completed {}%!' .format (round (jpool numb 100Pol 2)

Time.sleep (3)

Dataall.to_excel ('summarize your 2018 .xlsx' in one word)

After climbing down, delete the comments of more than one word, determine the size according to the word frequency, and make the word cloud map as follows:

2 and 5 are the same, climb down and merge together after analysis. The code is no longer repeated. In fact, use the above code to find the json address and replace it. All comments under any topic can be crawled to. Delete comments that are not 2 words and draw a graph according to word frequency:

SnowNLP was used to analyze the emotion of the comments. Among the 4027 comments, 2196 were positive and 1831 were negative.

Positive:

Negative:

The model is more accurate in judging the emotional tendency of most comments, and a small number of them are wrong.

Finally, the keywords are extracted from all the comments to make the word cloud ending:

The above is how Python crawled 4027 words. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report