Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl Baidu COVID-19 epidemic situation data through Scrapy framework

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how Python crawls Baidu COVID-19 epidemic data through the Scrapy framework, which has a certain reference value. Interested friends can refer to it. I hope you can learn a lot after reading this article.

Environmental deployment

Briefly recommend it.

Plug-in recommendation

Here we first recommend an extension to Google Chrome, xpath helper, to verify that the xpath syntax is correct.

Crawler target

Pages to be crawled: real-time update: COVID-19 epidemic map

The main target of crawling is the national data and the data of each identity.

Project creation

Create a project using the scrapy command

Scrapy startproject yqsjwebdriver deployment

I'm not going to repeat it here, but you can refer to the deployment method of my article: Python explains in detail the process of crawling the hot words of the title of the CSDN hot list through the Scrapy framework.

Project code

Let's start with the code and take a look at the Baidu epidemic province data.

The page needs to click to expand all span. So when extracting the source code of the page, you need to simulate the browser to open, click this button. So in this direction, we take it step by step.

Item definition

Define two classes YqsjProvinceItem and YqsjChinaItem, which define domestic provincial data and domestic data respectively.

# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class YqsjProvinceItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () location = scrapy.Field () new = scrapy.Field () exist = scrapy.Field () total = scrapy.Field () cure = scrapy.Field () dead = scrapy.Field ) class YqsjChinaItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () # existing confirmed exist_diagnosis = scrapy.Field () # asymptomatic asymptomatic = scrapy.Field () # existing suspected exist_suspecte = scrapy.Field () # existing severe exist_severe = scrapy.Field () # Cumulative diagnosis cumulative_diagnosis = scrapy .field () # overseas input overseas_input = scrapy.Field () # Cumulative cure cumulative_cure = scrapy.Field () # Cumulative death cumulative_dead = scrapy.Field () Middleware definition

You need to open the page and click to expand all.

Complete code

# Define here the models for your spider middleware## See documentation in:# https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # useful for handling different item types with a single interfacefrom itemadapter import is_item, ItemAdapterfrom scrapy.http import HtmlResponsefrom selenium.common.exceptions import TimeoutExceptionfrom selenium.webdriver import ActionChainsimport time class YqsjSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. Classmethod def from_crawler (cls, crawler): # This method is used by Scrapy to create your spiders. S = cls () crawler.signals.connect (s.spider_opened, signal=signals.spider_opened) return s def process_spider_input (self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. Return None def process_spider_output (self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. For i in result: yield i def process_spider_exception (self, response, exception, spider): # Called when a spider or process_spider_input () method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. Pass def process_start_requests (self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output () method, except # that it doesn't have a response associated. # Must return only requests (not items). For r in start_requests: yield r def spider_opened (self, spider): spider.logger.info ('Spider opened:% s'% spider.name) class YqsjDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. Classmethod def from_crawler (cls, crawler): # This method is used by Scrapy to create your spiders. S = cls () crawler.signals.connect (s.spider_opened, signal=signals.spider_opened) return s def process_request (self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: #-return None: continue processing this request #-or return a Response object #-or return a Request object #-or raise IgnoreRequest: process_exception () methods of # installed downloader middleware will be called # return None try: spider.browser.get (request.url) spider.browser.maximize_window Time.sleep (2) spider.browser.find_element_by_xpath ("/ / * [@ id='nationTable'] / div/span") .click () # ActionChains (spider.browser) .click (searchButtonElement) time.sleep (5) return HtmlResponse (url=spider.browser.current_url Body=spider.browser.page_source, encoding= "utf-8", request=request) except TimeoutException as e: print ('timeout exception: {}' .format (e)) spider.browser.execute_script ('window.stop ()') finally: spider.browser.close () def process_response (self, request, response Spider): # Called with the response returned from the downloader. # Must either; #-return a Response object #-return a Request object #-or raise IgnoreRequest return response def process_exception (self, request, exception, spider): # Called when a download handler or a process_request () # (from other downloader middleware) raises an exception. # Must either: #-return None: continue processing this exception #-return a Response object: stops process_exception () chain #-return a Request object: stops process_exception () chain pass def spider_opened (self, spider): spider.logger.info ('Spider opened:% s'% spider.name) defines the crawler

The domestic epidemic data and provincial epidemic data were obtained respectively. Complete code:

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2021-11-7 22 Author: supreme treasure # @ Site: # @ File: baidu_yq.py import scrapyfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Options from yqsj.items import YqsjChinaItem YqsjProvinceItem class YqsjSpider (scrapy.Spider): name = 'yqsj' # allowed_domains = [' blog.csdn.net'] start_urls = ['https://voice.baidu.com/act/newpneumonia/newpneumonia#tab0'] china_xpath = "/ / div [contains (@ class) 'VirusSummarySix_1-1-317room2ZJJBJBJ'] / text () "province_xpath =" / / * [@ id='nationTable'] / table/tbody/tr [{}] / td/text () "province_xpath_1 =" / / * [@ id='nationTable'] / table/tbody/tr [{}] / td/div/span/text () "def _ init__ (self): chrome_options = Options () Chrome_options.add_argument ('--headless') # uses headless Google browser mode chrome_options.add_argument ('--disable-gpu') chrome_options.add_argument ('--no-sandbox') self.browser = webdriver.Chrome (chrome_options=chrome_options Executable_path= "E:\\ chromedriver_win32\ chromedriver.exe") self.browser.set_page_load_timeout (30) def parse (self, response * * kwargs): country_info = response.xpath (self.china_xpath) yq_china = YqsjChinaItem () yq_china ['exist_diagnosis'] = country_info [0] .get () yq_china [' asymptomatic'] = country_info [1] .get () yq_china ['exist_suspecte'] = country_info [2] .get () yq_china [' exist_severe] '] = country_info [3] .get () yq_china [' cumulative_diagnosis'] = country_info [4] .get () yq_china ['overseas_input'] = country_info [5] .get () yq_china [' cumulative_cure'] = country_info [6] .get () yq_china ['cumulative_dead'] = country_info [7] .get () yield yq_ China # traverses 35 regions for x in range (1 35): path = self.province_xpath.format (x) path2 = self.province_xpath_1.format (x) province_info = response.xpath (path) province_name = response.xpath (path2) yq_province = YqsjProvinceItem () yq_province ['location'] = province_name.get () yq_province [' new '] = province_info [0] .get () yq_province [' exist'] = province_info [1] .get () yq_province ['total'] = province_info [2] .get () yq_province [' cure'] = province_info [3] .get () yq_province ['dead'] = province_info [4] .get () yield yq_provincepipeline output the result text

Output the results in a certain text format. Complete code:

# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter from yqsj.items import YqsjChinaItem, YqsjProvinceItem class YqsjPipeline: def _ _ init__ (self): self.file = open ('result.txt', 'walled, encoding='utf-8') def process_item (self, item) Spider): if isinstance (item YqsjChinaItem): self.file.write ("domestic epidemic\ nexisting confirmed\ t asymptomatic\ t {}\ nexisting suspected\ t {}\ nCumulative diagnosis\ t {}\ nCumulative diagnosis\ t {}\ nCumulative cure\ t {}\ nCumulative death\ t {}\ n" .format (item ['exist_diagnosis']) Item ['asymptomatic'], item [' exist_suspecte'], item ['exist_severe'], item [' cumulative_diagnosis'], item ['overseas_input'], item [' cumulative_cure'] Item ['cumulative_dead']) if isinstance (item, YqsjProvinceItem): self.file.write ("Province: {}\ t New: {}\ t existing: {}\ t Cumulative: {}\ t Cure: {}\ n" .format (item [' location']) Item ['new'], item [' exist'], item ['total'], item [' cure'], item ['dead']) return item def close_spider (self, spider): self.file.close () configuration file changes

Direct reference, self-adjustment:

# Scrapy settings for yqsj project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'yqsj' SPIDER_MODULES = [' yqsj.spiders'] NEWSPIDER_MODULE = 'yqsj.spiders' # Crawl responsibly by identifying yourself (and your website) on The user-agent#USER_AGENT = 'yqsj (+ http://www.yourdomain.com)'USER_AGENT =' Mozilla/5.0' # Obey robots.txt rulesROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = The download Delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = "Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers:DEFAULT_REQUEST_HEADERS = {'Accept':' text/html Application/xhtml+xml,application/xml "Accept-Language': 'en',' User-Agent': 'Mozilla/5.0 (Windows NT 6.2)" WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'} # Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlSPIDER_MIDDLEWARES = {'yqsj.middlewares.YqsjSpiderMiddleware': 543,} # Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {' yqsj.middlewares.YqsjDownloaderMiddleware': 543 } # Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#} # Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {' yqsj.pipelines.YqsjPipeline': 300 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = "The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 6" The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1. 5 Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False # Enable And configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE =' scrapy.extensions.httpcache.FilesystemCacheStorage' verification result

Look at the result file.

Thank you for reading this article carefully. I hope the article "how Python crawls Baidu COVID-19 epidemic data through Scrapy framework" shared by the editor is helpful to everyone. At the same time, I also hope you can support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report