How does Python crawl Douban movie ranking information? 07/02 Update SLTechnology News&Howtos

How does Python crawl Douban movie ranking information?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how Python climbs Douban movie ranking information". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "Python how to climb Douban movie ranking information".

Basic development environment

Python 3.6

Pycharm

Use of related modules

Requests

Parsel

Csv

Install Python and add it to the environment variable, and pip installs the relevant modules you need.

Basic ideas of reptiles

Request url address, use get request, add headers request header, simulate browser request, and the web page will return response object to you.

# simulate browser to send request import requestsurl = 'https://movie.douban.com/top250'headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} response = requests.get (url=url, headers=headers) print (response)

200 is the status code, indicating that the request was successful

2xx (successful)

3xx (redirect)

4xx (request error)

5xx (server error)

Common status code

200-the server successfully returned the web page and the client request was successful.

302-the object moves temporarily. The server currently responds to requests from web pages in different locations, but the requester should continue to use the original location for future requests.

304-belongs to redirection. The requested web page has not been modified since the last request. When the server returns this response, the web page content is not returned.

401-unauthorized. The request requires authentication. For web pages that need to log in, the server may return this response.

404-not found. The server could not find the requested web page.

The 503 (service unavailable) server is currently unavailable (due to overload or downtime maintenance). Usually, this is only a temporary state.

3. Get data import requestsurl = 'https://movie.douban.com/top250'headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} response = requests.get (url=url, headers=headers) print (response.text)

IV. Parsing data

Common methods of parsing data: regular expression, css selector, xpath, lxml.

Commonly used parsing modules: bs4, parsel...

We use parsel whether in the previous article or the subsequent crawler series, I will use parsel as a parsing library, but I just think it smells better than bs4.

Parsel is a third-party module, pip install parsel can be installed

Parsel can use css, xpath, re parsing methods

All movie information is contained in the li tag.

# convert response.text text data into selector object selector = parsel.Selector (response.text) # get all li tags lis = selector.css ('.grid _ view li') # traverse each li tag content for li in lis: # get the text data get () in the first span tag under the a tag under the hd class attribute of the movie title get () output is in the form of a string Getall () output form is a list to get all title = li.css ('.hd a span:nth-child (1):: text'). Get () # get () output form is the string movie_list = li.css (' .bd p:nth-child (1):: text'). Getall () # getall () output form is list star = movie_list [0] .strip (). Replace ('\ xa0\ xa0\ xa0' ''). Replace ('/...',') movie_info = movie_list [1] .strip (). Split ('\ xa0/\ xa0') # ['1994', 'USA' 'crime plot'] movie_time = movie_info [0] # Film release time movie_country = movie_info [1] # which country's movie movie_type = movie_info [2] # what kind of movie rating_num = li.css ('.crime _ num::text'). Get () # Movie rating people = li.css ('. Star Span:nth-child (4):: text'). Get () # number of evaluations summary = li.css ('.inq:: text'). Get () # an overview of dit = {' movie name': title 'cast': star, 'release time': movie_time, 'shooting country': movie_country, 'Film Type': movie_type, 'Film rating': rating_num, 'number of reviews': people, 'Movie Overview': summary,} # pprint format output module pprint.pprint (dit)

The above knowledge points have been used.

The method of parsel parsing module

For cycle

Css selector

The creation of a dictionary

List value

Methods of string segmentation, replacement, etc.

Pprint format output module

So it is necessary to have a solid foundation. Otherwise, you don't even know why the code is written like this.

5. Save data (data persistence)

With open, a commonly used method for saving data

Data such as Douban movie information will be better saved in the Excel table.

So you need to use the csv module.

# csv module stores data to Excelf = open ('Douban movie data .csv, mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= [' movie name', 'participant', 'release time', 'shooting country', 'film type', 'film rating', 'number of reviews') 'Movie Overview']) csv_writer.writeheader () # write header

This is to crawl the data and save it locally. This is just one page of data, crawling data is certainly not just crawling one page of data. In order to achieve multi-page data crawling, it is necessary to analyze the url address change law of web page data.

You can clearly see that the url address per page is incremented by 25, and the page turning operation is realized by using the for loop.

For page in range (0251) 25): url = f 'https://movie.douban.com/top250?start={page}&filter=' complete implementation code "import pprintimport requestsimport parselimport csv'''1, clear requirements: climb Douban Top250 to rank movie information movie name director, starring year, country, type score, Review the number of movies introduction''# csv module saves data to Excelf = open ('Douban movie data .csv' Mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['name of the film', 'performers', 'release time', 'shooting country', 'film type', 'film rating', 'number of reviews' 'Movie Overview']) csv_writer.writeheader () # write header # Analog browser send request for page in range (0251,25): url = f 'https://movie.douban.com/top250?start={page}&filter=' headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} response = requests.get (url=url) Headers=headers) # convert response.text text data into selector object selector = parsel.Selector (response.text) # get all li tags lis = selector.css ('.grid _ view li') # traverse each li tag content for li in lis: # get the text data in the first span tag under the a tag under the hd class attribute of the movie title Get () output form is a string to get a getall () output form is a list to get all title = li.css ('.hd a span:nth-child (1):: text'). Get () # get () output form is the string movie_list = li.css (' .bd p:nth-child (1):: text'). Getall () # getall () output The form is list star = movie_list [0] .strip () .replace ('\ xa0\ xa0\ xa0' ''). Replace ('/...',') movie_info = movie_list [1] .strip (). Split ('\ xa0/\ xa0') # ['1994', 'USA' 'crime plot'] movie_time = movie_info [0] # Film release time movie_country = movie_info [1] # which country's movie movie_type = movie_info [2] # what kind of movie rating_num = li.css ('.crime _ num::text'). Get () # Movie rating People = li.css ('.star span:nth-child (4):: text'). Get () # number of evaluations summary = li.css (' .inq:: text'). Get () # an overview of dit = {'movie name': title 'cast': star, 'release time': movie_time, 'shooting country': movie_country, 'Film Type': movie_type, 'Film rating': rating_num, 'number of reviewers': people, 'Film Overview': summary } pprint.pprint (dit) csv_writer.writerow (dit) implementation effect

Thank you for reading, the above is the content of "how Python climbs Douban movie ranking information". After the study of this article, I believe you have a deeper understanding of how Python crawls Douban movie ranking information, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.