Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to implement Douyin comment data crawling by Python

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Editor to share with you how to achieve Douyin comment data crawling by Python. I hope you will get something after reading this article. Let's discuss it together.

1. Grab data

Douyin has a web version, which makes it much easier to crawl data.

Pay attention to comments

Slide to the comments section of the web page, filter requests containing comment in the browser's web requests, and constantly refresh the comments to see the comments interface.

With the interface, you can write Python programs to simulate requests and get comment data.

The request data should be set at a certain interval to avoid excessive requests and affect the services of others.

There are two points to pay attention to when crawling comment data:

Sometimes the interface may return empty data, so you need to try it several times. Generally, after manual sliding verification, the interface is basically available.

Data between different pages may be duplicated, so a page skip request is required.

2. EDA

There were 12w comments on the 11.17 video, but I only crawled more than 1w.

The text column is a comment.

First of all, do some exploratory analysis of the data. I have introduced several EDA tools that can automatically output basic data statistics and charts.

This time I use ProfileReport.

# edaprofile = ProfileReport (df, title=' Zhang Douyin comment data', explorative=True) profile

Comment time distribution

From the point of view of the time distribution of comments, since the time of the video was released on the 17th, there were a large number of comments on the 17th and 18th. However, after that, even to 12.9, there are still a lot of new comments, indicating that the video is really very hot.

Length distribution of comments

Most of the comments are less than 20 words and no more than 40 words, indicating that they are all short texts.

Commentator identity

99.8% of the people who participated in the comments were unauthenticated, indicating that most of the comment users were ordinary users.

3. LDA

The above statistics are still too rough. But if we want to know what everyone is interested in, it is impossible to be so detailed as to read all 1.2w comments.

So we need to classify these comments first, which is equivalent to upgrading and abstracting the data. Because only by upgrading the data and understanding the meaning and proportion of each dimension can we grasp the data from a global point of view.

Here I use the LDA algorithm to cluster the text, and the aggregated comments can be regarded as belonging to the same topic.

The core idea of LDA algorithm has two points:

Texts with certain similarities will be aggregated to form a theme. Each topic contains the words needed to generate the topic, as well as the probability distribution of those words. From this, the category of the topic can be inferred artificially.

Each article will have a probability distribution under all topics, from which it can be inferred which topic the article belongs to.

For example, after clustering by LDA algorithm, words such as war and military expenditure are highly likely to appear in a topic, so we can classify the topic as military. If there is a high probability that an article belongs to a military topic, we can classify it into military categories.

After a brief introduction to the theory of LDA, let's do some practical work.

# participle emoji = {'pitiful', 'dazed', 'dizzy', 'brainstorming', 'high-five', 'send off the heart', 'burst into tears', 'yawn', 'lick the screen', 'snicker', 'happy', 'goodbye', '666', 'Xiong Ji', 'laugh', 'tongue sticking out', 'lip-tilting', 'look', 'green hat' 'cover your face', 'stupid innocent', 'strong', 'shocked', 'sinister', 'never', 'give force', 'hit face', 'coffee', 'bad', 'cheer together', 'cool drag', 'tears', 'black face', 'love', 'laugh and cry', 'witty', 'sleepy', 'smile kangaroo', 'strong', 'shut up' 'come and see me', 'Color', 'smirk', 'polite smile', 'Red face', 'nose picking', 'naughty', 'crape myrtle don't go', 'like', 'Bixin', 'leisurely', 'Rose', 'hold fist', 'Little applause', 'handshake', 'smirk', 'shy', 'crying soon', 'Shh', 'surprise' 'pig 's head', 'vomit', 'secret observation','no look', 'beer', 'bared teeth', 'anger', 'desperate gaze', 'laugh', 'spit blood', 'bad smile', 'gaze', 'cute', 'hug', 'wipe sweat', 'applause', 'victory', 'thank you', 'think', 'smile', 'question' Stopwords = [line.strip () for line in open ('stop_words.txt', encoding='UTF-8'). Readlines ()] def fen_ci (x): res = [] for x in jieba.cut (x): if x in stopwords or x in emoji or x in [' [' ']]: continue res.append (x) return' '.join (res) df [' text_wd'] = df ['text'] .apply (fen_ci)

Since there are a lot of emoji expressions in the comments, I extracted the text corresponding to all the emoji expressions and generated an emoji array to filter emoji words.

Call LDAfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.decomposition import LatentDirichletAllocationimport numpy as npdef run_lda (corpus, k): cntvec = CountVectorizer (min_df=2, token_pattern='\ values') cnttf = cntvec.fit_transform (corpus) lda = LatentDirichletAllocation (n_components=k) docres = lda.fit_transform (cnttf) return cntvec, cnttf, docres, lda cntvec, cnttf, docres, lda = run_lda (df ['text_wd'] .values, 8)

After many experiments, it is better to divide the data into 8 categories.

Select the words with probability top20 under each topic:

Word distribution of the topic

From the probability distribution of these words, the categories of each topic are summarized. Theme 0 to theme 7 are: unexpectedly, knowing where the key is, rural life, feeding the dog, shooting techniques, and locking the door. Put a lot of salt on the eggs and put the socks under the pillow.

Percentage of statistical topics:

Theme proportion

The red one is theme 3 (feeding the dog), which accounts for the largest proportion. Many people commented that they thought they were going to cook for themselves, but they didn't expect to feed the dog. That's what I thought when I saw it.

The proportion of other topics is relatively uniform.

After the classification of topics, we can find that Zhang's rural life has not only attracted everyone's attention, but also a large number of abnormal shots in the video.

Finally, the tree diagram is used to show each topic and the corresponding specific comments.

After reading this article, I believe you have a certain understanding of "how Python implements Douyin comment data crawling". If you want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report