How to climb Douban TOP250 with Scrapy 07/01 Update SLTechnology News&Howtos

How to climb Douban TOP250 with Scrapy

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to use Scrapy to climb Douban TOP250, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

The best way to learn is to input and then output, to share a small case of learning the scrapy framework, to easily and quickly master the basic methods of using scrapy.

I wanted to write a crawling tutorial with Scrapy from scratch, but officials already have a sample, so I'd rather not write it and try to share things that are not easy to find on the Internet. I am in closed training recently, and I am more like a snail. I'm sorry.

Introduction to Scrapy

Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs, including data mining, information processing, or storing historical data.

It was originally designed for page crawling (more specifically, web crawling), and can also be used to obtain data returned by API (such as Amazon Associates Web Services) or general web crawlers.

If you have no previous knowledge of scrapy, please check out the official tutorial link below.

Architecture Overview: https://docs.pythontab.com/scrapy/scrapy0.24/topics/architecture.html

Getting started with Scrapy: https://docs.pythontab.com/scrapy/scrapy0.24/intro/tutorial.html

Crawler course

First of all, let's take a look at the Douban TOP250 page and find that we can extract the movie name, ranking, rating, number of reviews, director, year, region, type, and movie description.

The Item object is a simple container that holds the crawled data. It provides a dictionary-like API and a simple syntax for declaring available fields. So you can declare Item in the following form.

Class DoubanItem (scrapy.Item):

# ranking

Ranking = scrapy.Field ()

# Movie title

Title = scrapy.Field ()

# score

Score = scrapy.Field ()

# number of comments

Pople_num = scrapy.Field ()

# Director

Director = scrapy.Field ()

# year

Year = scrapy.Field ()

# area

Area = scrapy.Field ()

# Typ

Clazz = scrapy.Field ()

# movie description

Decsription = scrapy.Field ()

After we crawl the corresponding web page, we need to extract the information we need from the web page. We can use xPath syntax. I use the BeautifulSoup web page parser. When the web page is parsed by BeautifulSoup, we can directly use the selector to filter the information we need. Some instructions have been written into the code comments, so I won't repeat them.

Chrome can also copy the selector or XPath directly, as shown in the following figure.

Class douban_spider (Spider):

Count = 1

# crawler start command

Name = 'douban'

# header message, pretending to be not a crawler

Headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'

}

# Crawler starts the link

Def start_requests (self):

Url = 'https://movie.douban.com/top250'

Yield Request (url, headers=self.headers)

# processing crawled data

Def parse (self, response):

Print ('No.', self.count, 'Page')

Self.count + = 1

Item = DoubanItem ()

Soup = BeautifulSoup (response.text, 'html.parser')

# Select a list of movies

Movies = soup.select ('# content div div.article ol li')

For movie in movies:

Item ['title'] = movie.select (' .title') [0] .text

Item ['ranking'] = movie.select (' em') [0] .text

Item ['score'] = movie.select (' .conversation _ num') [0] .text

Item ['pople_num'] = movie.select (' .star span') [3] .text

# including director, year, region, category

Info = movie.select ('.bd p') [0] .text

Director = info.strip () .split ('\ n') [0] .split ('')

Yac = info.strip () .split ('\ n') [1] .strip () .split ('/')

Item ['director'] = director [0] .split (':') [1]

Item ['year'] = yac [0]

Item ['area'] = yac [1]

Item ['clazz'] = yac [2]

# the movie description is empty, so it needs to be judged

If len (movie.select ('.inq')) is not 0:

Item ['decsription'] = movie.select (' .inq') [0] .text

Else:

Item ['decsription'] =' None'

Yield item

# next page:

# 1, you can find the address of the next page on the page

# 2. Construct the address according to the url rule. The second method is used here.

Next_url = soup.select ('.paginator. Next a') [0] [' href']

If next_url:

Next_url = 'https://movie.douban.com/top250' + next_url

Yield Request (next_url, headers=self.headers)

Then open the cmd command in the project folder, and run scrapy crawl douban-o movies.csv and you will find that the extracted information is written to the specified file. Here is the crawling result, which works well.

After reading the above, have you mastered how to climb Douban TOP250 with Scrapy? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.