Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the core components of Scrapy

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what are the core components of Scrapy". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what are the core components of Scrapy.

Reptiles

Let's start at the end of the previous article. The last time we talked about Scrapy running, we finally came to the crawl method of Crawler. Let's look at this method:

@ defer.inlineCallbacks def crawl (self, * args, * * kwargs): assert not self.crawling, "Crawling already taking place" self.crawling = True try: # find reptiles from spiderloader and instantiate reptiles selfself.spider = self._create_spider (* args) * * kwargs) # creation engine selfself.engine = self._create_engine () # call the reptile's start_requests method to get the seed URL list start_requests = iter (self.spider.start_requests ()) # the open_spider of the execution engine and pass in the crawler instance and the initial request yield self.engine.open_spider (self.spider Start_requests) yield defer.maybeDeferred (self.engine.start) except Exception: if six.PY2: exc_info = sys.exc_info () self.crawling = False if self.engine is not None: yield self.engine.close () if six.PY2: six.reraise (* exc_info) raise

At this point, we see that the crawler instance is first created, then the engine is created, and finally the crawler is handed over to the engine to handle.

As we mentioned in the previous article, when Crawler is instantiated, a SpiderLoader is created, which finds the location of the crawler based on our defined configuration file settings.py, where all the crawler code we write is here.

Then SpiderLoader scans these code files and finds that the parent class is a scrapy.Spider crawler class, and then generates a dictionary of {spider_name: spider_cls} based on the name attribute in the crawler class (which is required when writing the crawler). Finally, it finds the reptile class we wrote according to the spider_name in the scrapy crawl command, and then instantiates it. In this case, the _ create_spider method is called:

Def _ create_spider (self, * args, * * kwargs): # call class method from_crawler to instantiate return self.spidercls.from_crawler (self, * args, * * kwargs)

It is interesting to instantiate a crawler. Instead of initializing it with an ordinary constructor, it calls the class method from_crawler to find the scrapy.Spider class:

Classmethod def from_crawler (cls, crawler, * args, * * kwargs): spider = cls (* args, * * kwargs) spider._set_crawler (crawler) return spider def _ set_crawler (self, crawler): self.crawler = crawler # assign settings object to spider instance self.settings = crawler.settings crawler.signals.connect (self.close, signals.spider_closed)

Here we can see that this class method actually calls the constructor, instantiates it, and gets the settings configuration to see what the constructor does.

Class Spider (object_ref): name=None custom_settings = None def _ _ init__ (self, name=None, * * kwargs): # name required if name is not None: self.name = name elif not getattr (self, 'name') None): raise ValueError ("% s must have a name"% type (self). _ _ name__) self.__dict__.update (kwargs) # if start_urls is not set, the default is [] if not hasattr (self, 'start_urls'): self.start_urls = []

Are you familiar with this place? Here are some of the most commonly used attributes when writing reptiles: name, start_urls, custom_settings:

Name: use it to find the reptiles we wrote when running the crawler.

Start_urls: grab entry, also known as seed URL

Custom_settings: crawler custom configuration that overrides configuration items in the configuration file

engine

After analyzing the initialization of the crawler class, go back to the crawl method of Crawler, and then create the engine object, the _ create_engine method, and see what happens during initialization.

Class ExecutionEngine (object): "" engine "def _ _ init__ (self, crawler" Spider_closed_callback): self.crawler = crawler # here also save the settings configuration to the engine self.settings = crawler.settings # signal self.signals = crawler.signals # log format self.logformatter = crawler.logformatter self.slot = None self.spider = None self.running = False Self.paused = False # find the Scheduler scheduler from settings Find the Scheduler class self.scheduler_cls = load_object (self.settings ['SCHEDULER']) # again Find the Downloader downloader class downloader_cls = load_object (self.settings ['DOWNLOADER']) # instantiate Downloader self.downloader = downloader_cls (crawler) # instantiate Scraper which is the bridge between the engine and the reptiles self.scraper = Scraper (crawler) self._spider_closed_callback = spider_closed_callback

Here we can see that it is mainly to define and initialize several other core components, including: Scheduler, Downloader, Scrapyer, in which Scheduler is only defined by classes, not instantiated.

In other words, the engine is the core brain of the entire Scrapy, and it is responsible for managing and scheduling these components so that they work better together.

Let's look at how these core components are initialized in turn.

Dispatcher

Scheduler initialization occurs in the engine's open_spider method, so let's take a look at the scheduler initialization in advance.

Class Scheduler (object): "Scheduler"def _ _ init__ (self, dupefilter, jobdir=None, dqclass=None, mqclass=None, logunser=False, stats=None Pqclass=None): # fingerprint filter self.df = dupefilter # task queue folder selfself.dqdir = self._dqdir (jobdir) # priority task queue class self.pqclass = pqclass # disk task queue class self.dqclass = dqclass # memory task queue class Self.mqclass = mqclass # whether the log is serialized self.logunser = logunser self.stats = stats @ classmethod def from_crawler (cls Crawler): settings = crawler.settings # get fingerprint filter class dupefilter_cls = load_object (settings ['DUPEFILTER_CLASS']) # instantiated fingerprint filter dupefilter = dupefilter_cls.from_settings (settings) # get priority task queue class, disk queue class, Memory queue class pqclass = load_object (settings ['SCHEDULER_PRIORITY_QUEUE']) dqclass = load_object (settings [' SCHEDULER_DISK_QUEUE']) mqclass = load_object (settings ['SCHEDULER_MEMORY_QUEUE']) # request log serialization switch logunser = settings.getbool (' LOG_UNSERIALIZABLE_REQUESTS' Settings.getbool ('SCHEDULER_DEBUG')) return cls (dupefilter, jobdir=job_dir (settings), logunserlogunser=logunser, stats=crawler.stats, pqclasspqclass=pqclass, dqclassdqclass=dqclass, mqclassmqclass=mqclass)

As you can see, the initialization of the scheduler does two main things:

Instantiation request fingerprint filter: mainly used to filter duplicate requests

Define different types of task queues: priority task queues, disk-based task queues, memory-based task queues

What is the request fingerprint filter?

In the configuration file, we can see that the default fingerprint filter defined is RFPDupeFilter:

Class RFPDupeFilter (BaseDupeFilter): "" request fingerprint filter "" def _ _ init__ (self, path=None Debug=False): self.file = None # fingerprint collection uses Set based on memory self.fingerprints = set () self.logdupes = True self.debug = debug self.logger = logging.getLogger (_ _ name__) # request fingerprint can be stored on disk if path: self.file = open (os.path.join (path) Requests.seen'), 'DUPEFILTER_DEBUG'') self.file.seek (0) self.fingerprints.update (x.rstrip () for x in self.file) @ classmethod def from_settings (cls, settings): debug = settings.getbool ('DUPEFILTER_DEBUG') return cls (job_dir (settings), debug)

When the fingerprint filter is requested to initialize, a fingerprint set is defined, which uses the Set implemented in memory, and you can control whether these fingerprints are stored on disk for next reuse.

In other words, the main responsibility of the fingerprint filter is to filter duplicate requests, and filter rules can be customized.

In the next article, we will see what rules each request is based on to generate fingerprints, and then how to implement the repeated request filtering logic. Let's first know its function.

Let's take a look at the role of the task queues defined by the scheduler.

The scheduler defines two queue types by default:

Disk-based task queue: the storage path can be configured in the configuration file, and the queue task will be saved to disk after each execution

Memory-based task queue: executes in memory every time, but disappears the next time it starts

The default definition of the profile is as follows:

# disk-based task queue (LIFO) SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue' # memory-based task queue (LIFO) SCHEDULER_MEMORY_QUEUE =' scrapy.squeues.LifoMemoryQueue' # priority queue SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue'

If we define the JOBDIR configuration item in the configuration file, each time the crawler is executed, the task queue is saved on disk, and the next time the crawler is started, we can reload to continue our task.

If this configuration item is not defined, the memory queue is used by default.

If you are careful, you may find that these queue structures defined by default are last-in, first-out. What does that mean?

That is, when running our crawler code, if we generate a crawl task and put it in the task queue, then the next crawl will get the task from the task queue first and give priority to execution.

What does it mean to achieve this? In fact, it means: the default collection rule of Scrapy is depth first!

How to change this mechanism into breadth-first collection? At this point, we will take a look at the scrapy.squeues module, where many queues are defined:

# FIFO disk queue (pickle serialization) PickleFifoDiskQueue = _ serializable_queue (queue.FifoDiskQueue,\ _ pickle_serialize, pickle.loads) # LIFO disk queue (pickle serialization) PickleLifoDiskQueue = _ serializable_queue (queue.LifoDiskQueue,\ _ pickle_serialize, pickle.loads) # FIFO disk queue (marshal serialization) MarshalFifoDiskQueue = _ serializable_queue (queue.FifoDiskQueue,\ marshal.dumps) Marshal.loads) # LIFO disk queue (marshal serialization) MarshalLifoDiskQueue = _ serializable_queue (queue.LifoDiskQueue,\ marshal.dumps, marshal.loads) # FIFO memory queue FifoMemoryQueue = queue.FifoMemoryQueue # LIFO memory queue LifoMemoryQueue = queue.LifoMemoryQueue

If we want to change the crawl task to breadth first, we just need to change the queue class to FIFO queue in the configuration file! From this we can also see that the coupling between the various components of Scrapy is very low, and each module is customizable.

If you want to explore how these queues are implemented, you can refer to the scrapy/queuelib project written by the author of Scrapy, which can be found on Github. Here are the specific implementations of these queues.

Downloader

Back to where the engine is initialized, let's take a look at how the downloader is initialized.

In the default configuration file default_settings.py, the downloader is configured as follows:

DOWNLOADER = 'scrapy.core.downloader.Downloader'

Let's look at the initialization of the Downloader class:

Class Downloader (object): "Downloader" def _ _ init__ (self " Crawler): # similarly get the settings object self.settings = crawler.settings self.signals = crawler.signals self.slots = {} self.active = set () # initialize DownloadHandlers self.handlers = DownloadHandlers (crawler) # get the concurrency number of settings selfself.total_concurrency = self.settings from the configuration .getint ('CONCURRENT_REQUESTS') # concurrency of the same domain name selfself.domain_concurrency = self.settings.getint (' CONCURRENT_REQUESTS_PER_DOMAIN') # same IP concurrency selfself.ip_concurrency = self.settings.getint ('CONCURRENT_REQUESTS_PER_IP') # Random delayed download time selfself.randomize_delay = self.settings. Getbool ('RANDOMIZE_DOWNLOAD_DELAY') # initialize downloader middleware self.middleware = DownloaderMiddlewareManager.from_crawler (crawler) self._slot_gc_loop = task.LoopingCall (self._slot_gc) self._slot_gc_loop.start (60)

In this process, the download processor, the downloader middleware manager and the parameters related to fetching request control are initialized from the configuration file.

So what does the download processor do? What is the responsibility of the downloader middleware?

Let's first take a look at DownloadHandlers:

Class DownloadHandlers (object): "downloader processor" def _ _ init__ (self " Crawler): self._crawler = crawler self._schemes = {} # the classpath corresponding to the storage scheme is used to instantiate the self._handlers = {} # the downloader corresponding to the storage scheme self._notconfigured = {} # find the DOWNLOAD_HANDLERS_BASE in the configuration to construct the download processor # Note: here is the call The getwithbase method is used to take the XXXX_BASE configuration in the configuration handlers = without_none_values (crawler.settings.getwithbase ('DOWNLOAD_HANDLERS')) # Storage the classpath corresponding to the scheme is used to instantiate the for scheme Clspath in six.iteritems (handlers): self._ schemas [scheme] = clspath crawler.signals.connect (self._close, signals.engine_stopped)

The download processor is configured in the default configuration file as follows:

# user customizable download processor DOWNLOAD_HANDLERS = {} # default download processor DOWNLOAD_HANDLERS_BASE = {'file':' scrapy.core.downloader.handlers.file.FileDownloadHandler', 'http':' scrapy.core.downloader.handlers.http.HTTPDownloadHandler', 'https':' scrapy.core.downloader.handlers.http.HTTPDownloadHandler','S3 download: 'scrapy.core.downloader.handlers.s3.S3DownloadHandler' 'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',}

You should be able to understand that the download processor will choose the corresponding downloader to download the resource according to the type of download resource. Among them, the processors we use most are http and https.

Note, however, that here, these downloaders are not instantiated and will only be initialized once when a network request is actually initiated, as described in a later article.

Let's take a look at the initialization process of the downloader middleware DownloaderMiddlewareManager. Similarly, the class method from_crawler is called here to initialize, and DownloaderMiddlewareManager inherits the MiddlewareManager class to see what it has done in initialization:

Class MiddlewareManager (object): parent of all middleware Provide the middleware common method "" component_name = 'foo middleware' @ classmethod def from_crawler (cls, crawler): # call from_settings return cls.from_settings (crawler.settings, crawler) @ classmethod def from_settings (cls, settings) Crawler=None): # call subclass _ get_mwlist_from_settings to get modules of all middleware classes mwlist = cls._get_mwlist_from_settings (settings) middlewares = [] enabled = [] # instantiate for clspath in mwlist: try: # load these middleware Module mwcls = load_object (clspath) # call this method to instantiate if crawler and hasattr (mwcls) if this middleware class defines from_crawler 'from_crawler'): mw = mwcls.from_crawler (crawler) # if this middleware class defines from_settings, call this method to instantiate elif hasattr (mwcls,' from_settings'): mw = mwcls.from_settings (settings) # none of the above two methods Then directly call the construction instantiation else: mw = mwcls () middlewares.append (mw) enabled.append (clspath) except NotConfigured as e: if e.args: clsname = clspath.split ('.') [- 1] Logger.warning ("Disabled% (clsname) s:% (eargs) s" {'clsname': clsname,' eargs': e.args [0]}, extra= {'crawler': crawler}) logger.info ("Enabled% (componentname) ss:\ n% (enabledlist) s", {' componentname': cls.component_name 'enabledlist': pprint.pformat (enabled)}, extra= {' crawler': crawler}) # call the constructor return cls (* middlewares) @ classmethod def _ get_mwlist_from_settings (cls, settings): # what are the specific middleware classes Subclass definition raise NotImplementedError def _ init__ (self, * middlewares): self.middlewares = middlewares # define the middleware method self.methods = defaultdict (list) for mw in middlewares: self._add_middleware (mw) def _ add_middleware (self Mw): # subclasses defined by default can be overridden # if the middleware class has a defined open_spider, add it to methods if hasattr (mw 'open_spider'): self.methods [' open_spider'] .append (mw.open_spider) # if the middleware class has a defined close_spider, add it to methods # methods is a string of middleware methods that are called later in the chain if hasattr (mw, 'close_spider'): self.methods [' close_spider'] .insert (0, mw.close_spider)

DownloaderMiddlewareManager instantiation process:

Class DownloaderMiddlewareManager (MiddlewareManager): "" download middleware manager "" component_name = 'downloader middleware' @ classmethod def _ get_mwlist_from_settings (cls, settings): # get all downloader middleware return build_component_list (settings.getwithbase (' DOWNLOADER_MIDDLEWARES')) def _ add_middleware (self) from configuration files DOWNLOADER_MIDDLEWARES_BASE and DOWNLOADER_MIDDLEWARES Mw): # define a string of methods for downloader middleware requests, responses, and exceptions if hasattr (mw, 'process_request'): self.methods [' process_request'] .append (mw.process_request) if hasattr (mw, 'process_response'): self.methods [' process_response'] .insert (0, mw.process_response) if hasattr (mw 'process_exception'): self.methods [' process_exception'] .insert (0, mw.process_exception)

The downloader middleware manager inherits the MiddlewareManager class and then overrides the _ add_middleware method to define the default handling methods for download behavior before download, after download, and when an exception occurs.

Here we can think about the benefits of middleware doing this.

From here, we can roughly see that when you flow from one component to another, you will go through a series of middleware, each of which defines its own processing flow, which is equivalent to a pipeline, which can be processed for data when input. then it is sent to another component, and after another component processes the logic, it goes through a series of middleware, which can be processed against the response result and finally output.

Scraper

After the downloader is instantiated, go back to the initialization method of the engine, and then instantiate Scraper. As I mentioned in the article Scrapy source code analysis (1) Architecture Overview, this class does not appear in the architecture diagram, but this class is actually between Engine, Spiders, and Pipeline, and is a bridge connecting these three components.

Let's take a look at its initialization process:

Class Scraper (object): def _ _ init__ (self Crawler): self.slot = None # instantiate crawler middleware manager self.spidermw = SpiderMiddlewareManager.from_crawler (crawler) # load Pipeline processor class itemproc_cls = load_object (crawler.settings ['ITEM_PROCESSOR']) # instantiate Pipeline processor self.itemproc = itemproc_cls.from _ crawler (crawler) # get the number of tasks that are processed simultaneously from the configuration file self.concurrent_items = crawler.settings.getint ('CONCURRENT_ITEMS') self.crawler = crawler self.signals = crawler.signals self.logformatter = crawler.logformatter

Scraper created SpiderMiddlewareManager, and its initialization process:

Class SpiderMiddlewareManager (MiddlewareManager): "" crawler middleware manager "" component_name = 'spider middleware' @ classmethod def _ get_mwlist_from_settings (cls, settings): # get the default crawler middleware class return build_component_list (settings.getwithbase (' SPIDER_MIDDLEWARES')) def _ add_middleware (self) from SPIDER_MIDDLEWARES_BASE and SPIDER_MIDDLEWARES in the configuration file Mw): super (SpiderMiddlewareManager, self). _ add_middleware (mw) # defines the crawler middleware processing method if hasattr (mw, 'process_spider_input'): self.methods [' process_spider_input'] .append (mw.process_spider_input) if hasattr (mw) 'process_spider_output'): self.methods [' process_spider_output'] .insert (0, mw.process_spider_output) if hasattr (mw, 'process_spider_exception'): self.methods [' process_spider_exception'] .insert (0, mw.process_spider_exception) if hasattr (mw 'process_start_requests'): self.methods [' process_start_requests'] .insert (0, mw.process_start_requests)

The initialization of the crawler middleware manager is similar to the previous downloader middleware manager, first loading the default crawler middleware class from the configuration file, and then registering a series of process methods of the crawler middleware in turn. The default crawler middleware classes defined in the configuration file are as follows:

SPIDER_MIDDLEWARES_BASE = {# default crawler middleware classes' scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,' scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,' scrapy.spidermiddlewares.depth.DepthMiddleware': 900,}

Explain here the responsibilities of these default crawler middleware:

HttpErrorMiddleware: logical handling for non-20000 response errors

OffsiteMiddleware: if allowed_domains is defined in Spider, other domain name requests will be automatically filtered.

RefererMiddleware: append Referer header information

UrlLengthMiddleware: filter requests whose URL length exceeds the limit

DepthMiddleware: filter crawl requests that exceed the specified depth

Of course, here you can also define your own crawler middleware to deal with the logic you need.

After the crawler middleware manager is initialized, the Pipeline component is initialized. The default Pipeline component is ItemPipelineManager:

Class ItemPipelineManager (MiddlewareManager): component_name = 'item pipeline' @ classmethod def _ get_mwlist_from_settings (cls, settings): # load ITEM_PIPELINES_BASE and ITEM_PIPELINES classes return build_component_list (settings.getwithbase (' ITEM_PIPELINES')) def _ add_middleware (self, pipe): super (ItemPipelineManager) from the configuration file Self). _ add_middleware (pipe) # defines the default pipeline processing logic if hasattr (pipe, 'process_item'): self.methods [' process_item'] .append (pipe.process_item) def process_item (self, item, spider): # call the process_item method return self._process_chain ('process_item', item, spider) of all subclasses in turn

We can see that ItemPipelineManager is also a subclass of the middleware manager, because its behavior is very similar to middleware, but because it is functionally independent, it belongs to one of the core components.

From the initialization process of Scraper, we can see that it manages the data interaction related to Spiders and Pipeline.

Thank you for your reading, the above is the content of "what are the core components of Scrapy?" after the study of this article, I believe you have a deeper understanding of what the core components of Scrapy have, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report