How to grab related data with Python 07/08 Update SLTechnology News&Howtos

How to grab related data with Python

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "how to use Python to grab relevant data", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to grab relevant data with Python".

Data crawling

A skilful wife cannot make bricks without rice, and the most important thing before data analysis is "data acquisition". Therefore, I am going to use Python to crawl the essay data on Douban, as well as some comment time information and evaluation star information.

The crawling of data is mainly about the following:

1) about page turning operation page 1: https://movie.douban.com/subject/26413293/comments?status=P page 2: https://movie.douban.com/subject/26413293/comments?start=20&limit=20&status=P&sort=new_score page 3: https://movie.douban.com/subject/26413293/comments?start=40&limit=20&status=P&sort=new_score

Above we show the page links on pages 1-3, and we mainly observe the rules, where start is the starting position for getting comments, and limit represents how many comments to get. Observation: the difference of the three links lies in the difference of this start. When we turn the page later, we only need to modify the start parameter.

2) instructions on anti-pickpocketing

For Douban crawling, in fact, to find the real essay link, is extremely easy. But there is one thing I must say here, you can not log in to crawl data, but only for a period of time, after a period of time, you will detect that you are a crawler. Therefore, you must log in and bring cookie to crawl the data. If you sometimes don't know what to put in the request header, please add it all and summarize it slowly when you are free.

Headers = {"Accept": "application/json, text/plain, * / *", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", "Host": "movie.douban.com", "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', "Cookie": 'here is your own cookie'}

Cookie, some people may not know where it is, so let me tell you something. Many parameters are below. If you want to learn crawlers well, you should know what these parameters represent.

Finally, I would like to add one more point: I originally intended to climb down all of the "Da Qin Fu" on Douban as the material for analysis. However, did not successfully climb all the essays, twists and turns, and finally only climbed to 500, of course, I think this is also a kind of anti-picking measure of Douban, the maximum number of visible short comments is 500, more do not show you. (if there is a great god, you can go down and study it.)

Data processing.

There is a certain gap between the crawled data, no matter how regular it is, and the data used for analysis. Therefore, before the analysis, a certain amount of data cleaning is necessary. Before we clean the data, let's take a brief look at what the data looks like.

Df = pd.read_csv ("final_all_comment.csv", index_col=0) df.head (10)

The results are as follows:

In fact, the data is quite beautiful, but we still need to do the following:

1) eliminate duplicate values

We believe that if the 'comment time' and 'comment content' are exactly the same, we think he is the same comment and need to be deleted.

Print ("records before deletion:", df.shape) df.drop_duplicates (subset= ['comment time', 'comment content'], inplace=True,keep='first') print ("records before deletion:", df.shape) 2) comment time processing

Since "Da Qin Fu" was broadcast on December 1, 2020, and it is the evening of December 16, all comment data must be available in December 2020, so we only keep useful "date" data (which day). For hours and seconds, we only keep the "hour" data.

Df ["comment days"] = df ["comment time"] .str[ 8:-9] .astype (int) df ["hours"] = df ["comment time"] .str[ 11:-6] .astype (int) 3) comment star description

Looking at the comment stars on the original page, we can see that all the stars are not displayed in numbers, but are rendered in the front end with stars, but the source code of the page shows the number of stars.

Corresponding to the source code of the page, let's see what it looks like.

As you can see: the number of 3 stars is 30, and so on, the number of 1 star is 10, the number of 2 stars is 20. I just don't like it, so when I crawled the data, I already divided all these numbers by 10.

4) Mechanical compression and deduplication of comment content

For a comment, some people may make a mistake, or put together the number of words, there will be a word or word, repeat many times, so before the word segmentation, need to do "mechanical compression to remove repetition" operation. The following is a piece of code I wrote a long time ago, you can go to my CSDN blog, which has a good explanation.

Def func (st): for i in range (1 if st int (len (st) / 2) + 1): for j in range (len (st)): if st [j:j+i] = = st [j+i:j+2*i]: K = Jaimi while st [k:k+i] = = st [k+i:k+2*i] and k

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.