A preliminary study of scrapy Architecture 04/16 Update SLTechnology News&Howtos

A preliminary study of scrapy Architecture

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Scrapy data flow

The data flow in Scrapy is controlled by the execution engine. The following article is excerpted from the official website of Scrapy. I made comments based on my guess to show the direction for the further development of GooSeeker open source crawler:

The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.

URL, who will prepare it? It seems that Spider prepares it himself, so it can be guessed that the Scrapy architecture part (excluding Spider) is mainly for event scheduling, regardless of URL storage. It looks like the crawler compass of the GooSeeker member center, which prepares a batch of URLs for the target website and puts them in the compass to perform crawler scheduling operations. So, the next goal of this open source project is to put the management of URL in a centralized scheduling library.

The Engine asks the Scheduler for the next URLs to crawl.

See here is actually very difficult to understand, you have to read some other documents to understand. Then point 1, after the engine gets the URL from Spider, it encapsulates it into a Request and gives it to the event loop, which will be collected by Scheduler for scheduling management, which is temporarily understood as queuing for Request. The engine now looks for the address of the web page that Scheduler wants to download next.

The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader, passing through the Downloader Middleware (request direction).

Apply for tasks from the scheduler, give the applied tasks to the downloader, and there is a downloader middleware between the downloader and the engine, which is a necessary highlight of a development framework, where developers can make some customized extensions.

Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middleware (response direction).

When the download is complete, a Response is generated and handed to the engine through the downloader middleware. Note that the initials of Response and the previous Request are capitalized, and although I haven't looked at other Scrapy documents, I guess this is an event object within the Scrapy framework, and I can also speculate that it is an asynchronous event-driven engine, just like the three-level event loop of the DS counters, which is necessary for high-performance, low-overhead engines.

The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (input direction).

Once again, there is a middleware that gives developers enough room to play.

The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine.

Each Spider sequentially grabs a web page, and when one is finished, another Request event is constructed to start the crawling of another web page.

The Engine passes scraped items and new Requests returned by a spider through Spider Middleware (output direction), and then sends processed items to Item Pipelines and processed Requests to the Scheduler.

Engine for event distribution

The process repeats (from step 1) until there are no more requests from the Scheduler.

Run continuously.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.