Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to analyze 440000 pieces of data

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article is about how to use Python to analyze 440000 pieces of data. I think it is very practical, so I share it with you. I hope you can get something after reading this article.

There is a joke that "the ten-year old copywriter driver is not as good as the NetEase comment area, where NetEase writers walk everywhere and comment on all single dogs." NetEase Yun's music comment area has always been a gathering place for all kinds of copywriting gods.

So how on earth do we ordinary users become hot commentators in NetEase Yun's music reviews?

Let me analyze it.

Get data

The logic is not complicated:

Crawl all the playlist url in the playlist.

Enter each playlist and crawl all the songs url, to repeat.

Enter the front page of each song to crawl hot reviews, summary.

The playlist list is as follows:

Turn the page and observe its url changes, pay attention to the bottom picture, change 35% at the end of each page.

Use requests+pyquery to crawl.

If you don't know what you don't know in the learning process, you can add my python learning communication QQ qun,784758214 group. There are good learning video tutorials, development tools and e-books. Share with you the current talent needs of python enterprises and how to learn python from zero. And what to learn def get_list (): list1 = [] for i in range (0Magne1295Power35): url = 'https://music.163.com/discover/playlist/?order=hot&cat=%E5%8D%8E%E8%AF%AD&limit=35&offset='+str(i) print (' successfully collected% I page playlist\ n'% (iUnix 35'1)) data = [] html = restaurant ( Url) doc = pq (html) for i in range (1Magne 36): # 35 playlists a = doc ('# m-pl-container > li:nth-child ('+ str (I) +') > div > a'). Attr ('href') A1 =' https://music.163.com/api' + a.replace ('?' '/ detail?') Data.append (A1) list1.extend (data) time.sleep (5+random.random ()) return list1

In this way, we can get 38 pages of 35 playlists per page, with a total of 1300 + playlists.

Next we need to enter each playlist to climb all the songs url, and pay attention to the final "de-repetition", different playlists may contain the same song.

Click on a playlist and pay attention to the id circled in red.

Take a look. The information we need to get at the bottom of each playlist, which is circled in a red box, can be constructed by using the newly crawled playlist id and the api of NetEyun Music (detailed in the next article):

If it is not convenient to read, let's analyze the json.

Def get_playlist (url): data = [] doc = get_json (url) obj=json.loads (doc) jobs=obj ['result'] [' tracks'] for job in jobs: dic = {} dic ['name'] = jsonpath.jsonpath (job,'$..name') [0] # song name dic [' id'] = jsonpath.jsonpath (job '$.. id') [0] # song ID data.append (dic) return data

In this way, we get all the songs on the playlist and remember to repeat them.

# deduplicated data = data.drop_duplicates (subset=None, keep='first', inplace=True)

The rest is to get the hot reviews of each song, similar to the previous songs, but also based on the api structure, it is easy to find.

Def get_comments (url,k): data = [] doc = get_json (url) obj=json.loads (doc) jobs=obj ['hotComments'] for job in jobs: dic = {} dic [' content'] = jsonpath.jsonpath (job,'$..content') [0] dic ['time'] = stampToTime (jsonpath.jsonpath (job) '$.. time') [0]) dic [' userId'] = jsonpath.jsonpath (job ['user'],' $.. userId') [0] # user ID dic ['nickname'] = jsonpath.jsonpath (job [' user'],'$.. nickname') [0] # username dic ['likedCount'] = jsonpath.jsonpath (job,'$..likedCount') [0] dic [' name'] = k data.append (dic) return data

After summing up, we got 440000 pieces of music hot review data.

Data analysis

Clean and fill it.

Def data_cleaning (data): cols = data.columns for col in cols: if data [col] .dtype = = 'object': data [col] .fillna (' missing data', inplace = True) else: Data.fillna (0, inplace = True) return (data)

Put it in order according to the number of likes.

# sort df1 ['likedCount'] = df1 [' likedCount'] .astype ('int') df_2 = df1.sort_values (by= "likedCount", ascending=False) df_2.head ()

Let's take a look at which hot reviews are copied and pasted and moved around.

# sort df_line = df.groupby (['content']) .count () .reset_index () .sort_values (by= "name", ascending=False) df_line.head ()

The first and the third are just the difference between the end and the full stop, which can be classified into the same category. In that case, the sentence was repeated 412 times at most.

Take a look at which god has the most heat ratings? What lessons can we learn from him?

Df_user = df.groupby (['userId']). Count (). Reset_index (). Sort_values (by= "name", ascending=False) df_user.head ()

Summarize and sort by user_id.

Successfully "capture" a "joker", the number of hot comments is as high as 347, let's see what this great god has commented on.

Df_user_max = df.loc [(df ['userId'] = 101)] df_user_max.head ()

This "Mr. Chen with insomnia" seems to be adept at all kinds of love words. Let's take him as an example to see how to become a hot commentator in NetEase Yun's music criticism.

Data visualization

Let's take a look at the approval distribution of these 347 comments.

# praise distribution import matplotlib.pyplot as pltdata = df_user_max ['likedCount'] # data.to_csv ("df_user_max.csv", index_label= "index_label", encoding='utf-8-sig') plt.hist (data,100,normed=True,facecolor='g',alpha=0.9) plt.show ()

Obviously, the number of likes is not much, most of them are less than 500 likes, but hundreds of likes can be among the hot reviews, which also shows that these songs are relatively minority and often cast a wide net in the new song area.

We use len () to find the string length of each comment, and then draw a distribution map.

The number of words in the comments is concentrated between 18 and 30 words, which means that you should pay attention to the number of words when you leave a message. The safe practice is not to be too long for people to read, and not too short so as not to be classic enough.

Make a word.

It can be seen that his comment style begins with a song that reminds him of "feeling". The object is usually "like girl" and is often referred to as "she". The emotion of sustenance is "regret" and "sadness", and the end of emotion is "let go".

The above is how to use Python to analyze 440000 pieces of data. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report