How to analyze the knowledge of Scrapy framework 04/10 Update SLTechnology News&Howtos

How to analyze the knowledge of Scrapy framework

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces how to analyze the knowledge of Scrapy framework, the content is very detailed, interested friends can use it for reference, I hope it can be helpful to you.

Today I will write about what I have learned about the framework.

Spiders (crawler): it is responsible for processing all Responses, analyzing and extracting data from it, obtaining the data needed in the Item field, and submitting the URL that needs to be followed up to the engine, and entering the Scheduler (scheduler) again.

Engine (engine): responsible for communication, signal and data transmission among Spider, ItemPipeline, Downloader and Scheduler.

Scheduler (Scheduler): it is responsible for receiving Request requests sent by the engine, arranging them in a certain way, joining the queue, and returning them to the engine when needed.

Downloader (downloader): responsible for downloading all Requests requests sent by Scrapy Engine (engine) and returning the obtained Responses to Scrapy Engine (engine), which will be handed over to Spider for processing

ItemPipeline (pipeline): it is responsible for processing the Item obtained in Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares (download middleware): you can think of it as a component that can be customized to extend the download function.

Spider Middlewares (Spider middleware): you can understand it as an intermediate between the extension and operation engine and Spider.

Functional components of communication (such as Responses; into Spider and Requests out of Spider)

Scrapy uses Twisted ['tw asynchronous st communication] (its main rival is Tornado) asynchronous network framework to deal with network communications, which can speed up our download speed, do not have to implement the asynchronous framework ourselves, and include a variety of middleware interfaces, which can flexibly complete various requirements.

The workflow of scrapy, which I drew by myself:

1. First, Spiders (crawler) delivers the url (requests) that needs to send the request to Scheduler (scheduler) via ScrapyEngine (engine).

After 2.Scheduler (sorting, joining the queue), it is handed over to Downloader by ScrapyEngine,DownloaderMiddlewares (optional, mainly User_Agent, Proxy agent).

3.Downloader sends a request to the Internet and receives a download response (response). Send the response (response) to Spiders via ScrapyEngine,SpiderMiddlewares (optional).

4.Spiders processes response, extracts the data and sends it to ItemPipeline via ScrapyEngine for storage (either locally or in a database).

5. Extract the url and then pass it to Scheduler via ScrapyEngine for the next cycle. Until no Url requestor stops.

This is the basic principle.

URL-> will be packaged as requests-> engine-- > Scheduler-- > sort the requests, queue operation-- > engine-- > downloader-- > request web page to get response--- > crawler spiders (parsing response data)-- > pipeline (save data)

-- > New URL, New requests (continue the loop until there is no URL)

This is the end of the analysis on how to carry out the Scrapy framework knowledge. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.