Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to climb 4027 pulse words

2025-03-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about how to use Python to climb 4027 pulse words, many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Pulse is a real-name workplace social platform. Before climbing the pulse professional speech section, about 4027 comments, this article gives a detailed description of the crawling process, only visual analysis of the comments, there are a lot of articles in this area, all of which are 404 today.

Reptile

Still using python programming, those who are not interested in crawlers can skip the next section without affecting the readability.

The website is https://maimai.cn/gossip_list.

You need to log in before you can see the contents. Crawl target:

Only climb the text part, do not consider the picture.

Press F12 in the browser to open the developer, slide down, and you will see a lot of json files at the beginning of gossip (refresh if not)

Right-click open in new tab, inside is a record, text is followed by comments.

The information we are interested in is the following

Take a look at the address of each website. It ends with a page= number, so write a loop when you climb, and the number starts at 1 and goes back.

Https://maimai.cn/sdk/web/gossip_list?u=206793936&channel=www&version=4.0.0&_csrf=coAlLvgS-UogpI75vEgHk4O1OQivF2ofLce4&access_token=1.9ff1c9df8547b2b2c62bf58b28e84b97&uid=%22MRlTFjf812rF62rOeDhC6vAirs3A3wL6ApgZu%2Fo1crA%3D%22&token=%22rE8q1xp6fZlxvwygWJn1UFDjrmMXDrSE2tc6uDKNIDZtRErng0FRwvduckWMwYzn8CKuzcDfAvoCmBm7%2BjVysA%3D%3D%22&page=1&jsononly=1

At the beginning of json, there are two parameters, total and remain, which give the remaining number and total number of all visible comments, which can be used as the stop condition of the loop.

But the snag is that you can't see all the comments, and the comments are constantly refreshed, so if you climb one page and cycle to the next page or try many times, he will prompt you:

It's nice to have such a hint when you read it directly, but it's not very friendly for the crawler, and you need to add an if judgment.

In addition, if you climb too fast, you will make mistakes. Remember to add time.sleep.

Probably stepped on all the holes you can step on, so if it goes well, you can only climb a few hundred messages at a time. If you want to climb more, you need to wait for a period of time when the information is almost updated. The code is as follows.

#-*-coding: utf-8-*-"" Created on Fri Oct 19 18:50:03 2018 "" import urllib import requests from fake_useragent import UserAgent import json import pandas as pd import time import datetime # comment_api = 'https://maimai.cn/sdk/web/gossip_list?u=206793936&channel=www&version=4.0.0&_csrf=7ZRpwOSi-JHa7JrTECXLA8njznQZVbi7d4Uo&access_token=1.b7e3acc5ef86e51a78f3410f99aa642a&uid=%22MRlTFjf812rF62rOeDhC6vAirs3A3wL6ApgZu%2Fo1crA%3D%22&token=%22xoNo1TZ8k28e0JTNFqyxlxg%2BdL%2BY6jtoUjKZwE3ke2IZ919o%2FAUeOvcX2yA03CAx8CKuzcDfAvoCmBm7%2BjVysA%3D%3D% " 22 comments page = {} & jsononly=1' # send get request comment_api = 'https://maimai.cn/sdk/web/gossip_list?u=206793936&channel=www&version=4.0.0&_csrf=FfHZIyBb-H4LEs35NcyhyoAvRM7OkMRB0Jpo&access_token=1.0d4c87c687410a15810ee6304e1cd53b&uid=%22MRlTFjf812rF62rOeDhC6vAirs3A3wL6ApgZu%2Fo1crA%3D%22&token=%22G7rGLEqmm1wY0HP4q%2BxpPFCDj%2BHqGJFm0mSa%2BxpqPg47egJdXL%2FriMlMlHuQj%2BgM8CKuzcDfAvoCmBm7%2BjVysA%3D%3D%22&page={}&jsononly=1' "author: author text: comments cmts: number of comments circles_views: viewed Spread: retweet likes: likes time: time "headers = {" User-Agent ": UserAgent (verify_ssl=False) .random} j = 0k = 0 response_comment = requests.get (comment_api.format (0)) Headers = headers) json_comment = response_comment.text json_comment = json.loads (json_comment) num = json_comment ['total'] cols = [' author','text','cmts','likes','circles_views','spreads','time'] dataall = pd.DataFrame (index = range (num)) Columns = cols) remain = json_comment ['remain'] print (remain) while mainstay = 0: n = json_comment [' count'] for i in range (n): if json_comment ['data'] [I] [' text']! ='I have seen the following Click here to refresh': dataall.loc [jjgramme authorship'] = json_comment ['data'] [I] [' author'] dataall.loc [jjjjjrecoverytext'] = json_comment ['data'] [I] [' text'] dataall.loc [jdataall.loc] [I] ['cmts'] dataall.loc [j 'likes'] = json_comment [' data'] [I] ['likes'] dataall.loc [j json_comment [' data'] [I] ['circles_views'] dataall.loc [j data'] [I] [' spreads'] dataall.loc [j 'time'] = json_comment [' data'] [I] ['time'] jacks = 1 else: K =-1 break knot = 1 comment_api1 = comment_api.format (k) response_comment = requests.get (comment_api1) Headers = headers) json_comment = response_comment.text json_comment = json.loads (json_comment) remain = json_comment ['remain'] print (' completed {}%!') time.sleep (3) dataall = dataall.dropna () dataall = dataall.drop_duplicates () dataall.to_csv ('data_20181216_part3.csv',index = False)

Data visualization

After crawling a pile of files on and off, we got 4027 pieces of data in the following format

Next, do some simple analysis of the data you climbed to. Because it is not a full comment, it is only a small sample, so the result must be biased, but the climbing time is very random, and it has been climbing back and forth for more than two weeks, so the selection is relatively random and representative.

There are two types of speakers in the pulse, one is completely anonymous, using system-generated nicknames, and the other is displayed as xx employees. We count the number of these two types of users and the number of posts in the sample. Among the 4027 official statements, there are a total of 1100 people who posted different posts.

Anonymous posters are more than 70%, and no one is willing to speak with their real identities. After all, the risk of being human flesh by the company / school is still very high.

The number of posts is not surprising, anonymous posters contributed more than 85% of the posts.

Anonymous posters can not get more detailed data, but for those who are not anonymous, they can get the information of their company, and summarize the number of posts by company, looking at the number of posts made by major enterprises, which can be used as an estimate as a whole. Statistics have taken into account the inconsistent input of company names, replacing Ant Financial Services Group and Alipay with Alibaba, JD.com Finance with JD.com, Jinri Toutiao and Douyin with byte beats, and taking the number of posts TOP20.

It can be seen that most of the posters come from Internet enterprises, while finance, real estate and other enterprises are relatively few.

Text analysis

For retweets, comments and likes, it is difficult to compare directly because of the difference in crawling time. Give the top five comments with the most comments and see what topics you are most willing to participate in.

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Summarize your 2018 in one word. (1659 comments)

[re-employment help post] I am a newly optimized Zhihu programmer who has worked for 3 years. Rather want to go to BAT and other big factories, hope that your factory HR with company certification to reply, send a real hc positions, wish brothers can find new jobs. (610 comments)

Summarize your current job in two words. (477 comments)

NetEase's salary has increased by 50% this year. Is the company rich? (458 comments)

Summarize your work in two words. (415 comments)

It's an interesting question. Let's climb down and make a word cloud from all the comments of 1meme 4 and 5 to see what everyone is talking about.

Summarize your 2018 in one word

The crawler process is basically the same as above, finding json, but this can climb to all the comments.

#-*-coding: utf-8-*-"Created on Fri Oct 19 18:50:03 2018" import urllib import requests from fake_useragent import UserAgent import json import pandas as pd import time # send get request comment_api = 'https://maimai.cn/sdk/web/gossip/getcmts?gid=18606987&page={}&count=50&hotcmts_limit_count=1&u=206793936&channel=www&version=4.0.0&_csrf=38244DlN-X0iNIk6A4seLXFx6hz3Ds6wfQ0Y&access_token=1.9ff1c9df8547b2b2c62bf58b28e84b97&uid=%22MRlTFjf812rF62rOeDhC6vAirs3A3wL6ApgZu%2Fo1crA%3D% 22 examples tokenballs% 22rE8q1xp6fZlxvywygWJn1UFDjrmMXDrSE2tc6uDKNIDZtRErng0FRwvkWMwYzn8CKuzcDfAvoCmBm7% 2BjVysA% 3D% 3D% 22' "author: author text: comments," "" # headers = {"User-Agent": UserAgent (verify_ssl=False) .random 'Cookie':cookie} headers = {"User-Agent": UserAgent (verify_ssl=False) .random} j = 0k = 0 response_comment = requests.get (comment_api.format (0), headers = headers) json_comment = response_comment.text json_comment = json.loads (json_comment) num = json_comment [' total'] cols = ['author','text'] dataall = pd.DataFrame (index = range (num), columns = cols) while j

< num : n = json_comment['count'] for i in range(n): dataall.loc[j,'author'] = json_comment['comments'][i]['name'] dataall.loc[j,'text'] = json_comment['comments'][i]['text'] j+= 1 k += 1 comment_api1 = comment_api.format(k) response_comment = requests.get(comment_api1,headers = headers) json_comment = response_comment.text json_comment = json.loads(json_comment) print('已完成 {}% !'.format(round(j/num*100,2))) time.sleep(3) dataall.to_excel('用一个字概括你的2018年.xlsx') 爬下来之后,删掉超过一个字的评论,按词频确定大小,做词云图如下

Summarize your current work in two words | summarize your work in two words

2Jing 5 is the same, climb down and merge together after analysis. The code is no longer repeated, in fact, use the above code to find the json address and replace it. All comments under any topic can be crawled to. Delete comments that are not two words and draw pictures according to word frequency.

SnowNLP was used to analyze the emotion of the comments. Among the 4027 comments, 2196 were positive and 1831 were negative.

Positive

Negative

The model is more accurate in judging the emotional tendency of most comments, and a small number of them are wrong.

Extract keywords from all comments and make word cloud endings.

After reading the above, do you have any further understanding of how to climb 4027 pulses with Python? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 258

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report