In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
What this article shares with you is the movie information about how Python crawls Douban through the xpath attribute. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Without saying much, let's take a look at it.
Preface
For the record: this article is mainly for research and use, but for no other purpose.
GitHub warehouse address: github project warehouse
Page analysis
The main crawled pages are: https://movie.douban.com/cinema/nowplaying/nanjing/
As for the latter areas, they can be changed according to their own needs, but I will repeat them. The page needs to click to expand all the movies in order to display all the content, otherwise there are only 15. So when we use selenium, we need to add a click logic after opening the page. The page is as follows:
Through the F12 expansion of the source code, use the xpath helper tool to verify the right-click copied xpath path.
To avoid missing it due to layout adjustments, I changed the xpath to get it through the class name.
Then look at the information of each film.
Analyze whether it is possible to use the div of nowplaying as the root node, and then get the node whose class is list-item below, and the attributes in it are what we want.
There's no problem, so follow this line of thinking and start creating project codes.
Implementation process to create a project
Create a project that is more douban_playing, using the scrapy command.
Scrapy startproject douban_playing
Item definition
Define movie information entities.
# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class DoubanPlayingItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () # Movie name title = scrapy.Field () # Movie score score = scrapy.Field () # Movie release year release = scrapy.Field () # Movie duration duration = scrapy.Field () # Regional region = scrapy.Field () # Movie Director director = scrapy.Field () # Movie starring actors = scrapy.Field () Middleware Operation definition
Mainly click to expand all the movies, need to add a section of code.
# Define here the models for your spider middleware## See documentation in:# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlimport time from scrapy import signals # useful for handling different item types with a single interfacefrom itemadapter import is_item, ItemAdapterfrom scrapy.http import HtmlResponsefrom selenium.common.exceptions import TimeoutException class DoubanPlayingSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. Classmethod def from_crawler (cls, crawler): # This method is used by Scrapy to create your spiders. S = cls () crawler.signals.connect (s.spider_opened, signal=signals.spider_opened) return s def process_spider_input (self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. Return None def process_spider_output (self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. For i in result: yield i def process_spider_exception (self, response, exception, spider): # Called when a spider or process_spider_input () method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. Pass def process_start_requests (self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output () method, except # that it doesn't have a response associated. # Must return only requests (not items). For r in start_requests: yield r def spider_opened (self, spider): spider.logger.info ('Spider opened:% s'% spider.name) class DoubanPlayingDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. Classmethod def from_crawler (cls, crawler): # This method is used by Scrapy to create your spiders. S = cls () crawler.signals.connect (s.spider_opened, signal=signals.spider_opened) return s def process_request (self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: #-return None: continue processing this request #-or return a Response object #-or return a Request object #-or raise IgnoreRequest: process_exception () methods of # installed downloader middleware will be called # return None try: spider.browser.get (request.url) spider.browser.maximize_window Time.sleep (2) spider.browser.find_element_by_xpath ("/ / * [@ id='nowplaying'] / div [@ class='more']") .click () # ActionChains (spider.browser) .click (searchButtonElement) time.sleep (5) return HtmlResponse (url=spider.browser.current_url) Body=spider.browser.page_source, encoding= "utf-8", request=request) except TimeoutException as e: print ('timeout exception: {}' .format (e)) spider.browser.execute_script ('window.stop ()') finally: spider.browser.close () def process_response (self, request, response Spider): # Called with the response returned from the downloader. # Must either; #-return a Response object #-return a Request object #-or raise IgnoreRequest return response def process_exception (self, request, exception, spider): # Called when a download handler or a process_request () # (from other downloader middleware) raises an exception. # Must either: #-return None: continue processing this exception #-return a Response object: stops process_exception () chain #-return a Request object: stops process_exception () chain pass def spider_opened (self, spider): spider.logger.info ('Spider opened:% s'% spider.name) crawler definition
According to the attribute name, we take out all the film information. Pay attention to the way you take out the attribute.
#! / user/bin/env python# coding=utf-8 "" @ project: douban_playing@author: huyi@file: douban_playing.py@ide: PyCharm@time: 2021-11-10 16:31:23 "" import scrapyfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Options from douban_playing.items import DoubanPlayingItem class DoubanPlayingSpider (scrapy.Spider): name = 'dbp' # allowed_domains = [' blog.csdn.net'] start_urls = [' Https://movie.douban.com/cinema/nowplaying/nanjing/'] nowplaying = "/ / * [@ id='nowplaying'] / div [@ class='mod-bd'] / / * [@ class='list-item'] / @ {}" properties = ['data-title' 'data-score', 'data-release',' data-duration', 'data-region',' data-director' 'data-actors'] def _ init__ (self): chrome_options= Options () chrome_options.add_argument ('-- headless') # use headless Google browser mode chrome_options.add_argument ('--disable-gpu') chrome_options.add_argument ('--no-sandbox') self.browser = webdriver.Chrome (chrome_options=chrome_options) Executable_path= "E:\\ chromedriver_win32\ chromedriver.exe") self.browser.set_page_load_timeout (30) def parse (self, response * * kwargs): titles = response.xpath (self.nowplaying.format (self.properties [0])). Extract () scores = response.xpath (self.nowplaying.format (self.properties [1])). Extract () releases = response.xpath (self.nowplaying.format (self.properties [2])). Extract () durations = response.xpath (self.nowplaying.format (self.properties [3])). Extract () Regions = response.xpath (self.nowplaying.format (self.properties [4])). Extract () directors = response.xpath (self.nowplaying.format (self.properties [5])). Extract () actors = response.xpath (self.nowplaying.format (self.properties [6]). Extract () for x in range (len (titles)): item = DoubanPlayingItem () item ['title'] = titles [x] item ['score'] = scores [x] item [' release'] = releases [x] item ['duration'] = durations [x] item [' region'] = regions [x] item ['director'] = directors [x] item [' actors'] = actors [x] yield item data pipe definition
As usual, the extracted movie data is output in the text according to the format.
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter class DoubanPlayingPipeline: def _ _ init__ (self): self.file = open ('result.txt', 'walled, encoding='utf-8') def process_item (self, item) Spider): self.file.write ("Movie: {}\ t score: {}\ t release year: {}\ t Movie duration: {}\ t region: {}\ t Movie Director: {}\ n starring: {}\ n" .format (item ['title'], item [' score'], item ['release']) Item ['duration'], item [' region'], item ['director'], item [' actors']) return item def close_spider (self, spider): self.file.close () configuration settings
It's all routine, just let go of a few default configurations.
# Scrapy settings for douban_playing project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban_playing' SPIDER_MODULES = [' douban_playing.spiders'] NEWSPIDER_MODULE = 'douban_playing.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'douban_playing (+ http://www.yourdomain.com)'USER_AGENT =' Mozilla/5.0' # Obey robots.txt rulesROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs# DOWNLOAD_DELAY = "The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers:DEFAULT_REQUEST_HEADERS = {'Accept':' text/html Application/xhtml+xml,application/xml "Accept-Language': 'en',' User-Agent': 'Mozilla/5.0 (Windows NT 6.2)" WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'} # Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlSPIDER_MIDDLEWARES = {'douban_playing.middlewares.DoubanPlayingSpiderMiddleware': 543,} # Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {' douban_playing.middlewares.DoubanPlayingDownloaderMiddleware': 543 } # Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#} # Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {' douban_playing.pipelines.DoubanPlayingPipeline': 300 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = "The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 6" The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1. 5 Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False # Enable And configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE =' scrapy.extensions.httpcache.FilesystemCacheStorage' performs verification
As usual, instead of using the scrapy command directly, construct a py to execute cmd. Note the location of the py.
Take a look at the results of the implementation.
Perfect!
This is how Python crawled Douban's hit movie information through the xpath attribute. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.