In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
How to use Scrapy to climb Douban TOP250, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
The best way to learn is to input and then output, to share a small case of learning the scrapy framework, to easily and quickly master the basic methods of using scrapy.
I wanted to write a crawling tutorial with Scrapy from scratch, but officials already have a sample, so I'd rather not write it and try to share things that are not easy to find on the Internet. I am in closed training recently, and I am more like a snail. I'm sorry.
Introduction to Scrapy
Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs, including data mining, information processing, or storing historical data.
It was originally designed for page crawling (more specifically, web crawling), and can also be used to obtain data returned by API (such as Amazon Associates Web Services) or general web crawlers.
If you have no previous knowledge of scrapy, please check out the official tutorial link below.
Architecture Overview: https://docs.pythontab.com/scrapy/scrapy0.24/topics/architecture.html
Getting started with Scrapy: https://docs.pythontab.com/scrapy/scrapy0.24/intro/tutorial.html
Crawler course
First of all, let's take a look at the Douban TOP250 page and find that we can extract the movie name, ranking, rating, number of reviews, director, year, region, type, and movie description.
The Item object is a simple container that holds the crawled data. It provides a dictionary-like API and a simple syntax for declaring available fields. So you can declare Item in the following form.
Class DoubanItem (scrapy.Item):
# ranking
Ranking = scrapy.Field ()
# Movie title
Title = scrapy.Field ()
# score
Score = scrapy.Field ()
# number of comments
Pople_num = scrapy.Field ()
# Director
Director = scrapy.Field ()
# year
Year = scrapy.Field ()
# area
Area = scrapy.Field ()
# Typ
Clazz = scrapy.Field ()
# movie description
Decsription = scrapy.Field ()
After we crawl the corresponding web page, we need to extract the information we need from the web page. We can use xPath syntax. I use the BeautifulSoup web page parser. When the web page is parsed by BeautifulSoup, we can directly use the selector to filter the information we need. Some instructions have been written into the code comments, so I won't repeat them.
Chrome can also copy the selector or XPath directly, as shown in the following figure.
Class douban_spider (Spider):
Count = 1
# crawler start command
Name = 'douban'
# header message, pretending to be not a crawler
Headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
# Crawler starts the link
Def start_requests (self):
Url = 'https://movie.douban.com/top250'
Yield Request (url, headers=self.headers)
# processing crawled data
Def parse (self, response):
Print ('No.', self.count, 'Page')
Self.count + = 1
Item = DoubanItem ()
Soup = BeautifulSoup (response.text, 'html.parser')
# Select a list of movies
Movies = soup.select ('# content div div.article ol li')
For movie in movies:
Item ['title'] = movie.select (' .title') [0] .text
Item ['ranking'] = movie.select (' em') [0] .text
Item ['score'] = movie.select (' .conversation _ num') [0] .text
Item ['pople_num'] = movie.select (' .star span') [3] .text
# including director, year, region, category
Info = movie.select ('.bd p') [0] .text
Director = info.strip () .split ('\ n') [0] .split ('')
Yac = info.strip () .split ('\ n') [1] .strip () .split ('/')
Item ['director'] = director [0] .split (':') [1]
Item ['year'] = yac [0]
Item ['area'] = yac [1]
Item ['clazz'] = yac [2]
# the movie description is empty, so it needs to be judged
If len (movie.select ('.inq')) is not 0:
Item ['decsription'] = movie.select (' .inq') [0] .text
Else:
Item ['decsription'] =' None'
Yield item
# next page:
# 1, you can find the address of the next page on the page
# 2. Construct the address according to the url rule. The second method is used here.
Next_url = soup.select ('.paginator. Next a') [0] [' href']
If next_url:
Next_url = 'https://movie.douban.com/top250' + next_url
Yield Request (next_url, headers=self.headers)
Then open the cmd command in the project folder, and run scrapy crawl douban-o movies.csv and you will find that the extracted information is written to the specified file. Here is the crawling result, which works well.
After reading the above, have you mastered how to climb Douban TOP250 with Scrapy? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.