In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
How to understand the Scrapy framework in Python? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
Why is scrapy a framework rather than a library?
How does scrapy work?
Project structure
Before you can start crawling, you must create a new Scrapy project. Go to the directory where you want to store the code and run the following command:
Note: when you create a project, the directory of the crawler project is created under the current directory.
These documents are:
Scrapy.cfg: configuration file for the project
Quotes/: the python module of the project. You will then add the code here. Quotes/items.py: the item file in the project.
Quotes/middlewares.py: crawler middleware, download middleware (processing request and response)
Quotes/pipelines.py: the pipelines file in the project. Quotes/settings.py: settings file for the project
Quotes/spiders/: the directory where the spider code is placed.
Scrapy schematic diagram
Introduction of each component
Engine . The engine, which handles the data flow processing and triggering transactions of the whole system, is the core of the whole framework.
Item . Project, which defines the data structure of the crawled result, and the crawled data is assigned to the Item object.
Scheduler . The scheduler accepts the request from the engine and queues it, and provides the request to the engine when the engine requests again.
Downloader . The downloader downloads the web content and returns it to the spider.
Spiders . Spider, which defines the logic of crawling and the parsing rules of web pages, is mainly responsible for parsing responses and generating results and new requests.
Item Pipeline . The project pipeline is responsible for handling projects extracted from web pages by spiders. Its main task is to clean, validate, and store data.
Downloader Middlewares . Downloader middleware, a hook framework between the engine and the downloader, mainly deals with requests and responses between the engine and the downloader.
Spider Middlewares . Spider middleware, a hook framework between the engine and the spider, mainly deals with spider input responses and output results and new requests
The flow of data
Scrapy Engine (engine): responsible for communication, signal and data transmission among Spider, ItemPipeline, Downloader and Scheduler.
Scheduler (Scheduler): responsible for receiving Request requests from the engine, and arranging them in a certain way, joining the queue when the engine needs
Return it to the engine.
Downloader (downloader): responsible for downloading all Requests requests sent by Scrapy Engine (engine) and returning the obtained Responses to Scrapy Engine (engine), which will be handed over to Spider for processing
Spider (crawler): responsible for processing all Responses, analyzing and extracting data from it, obtaining the data needed in the Item field, and submitting the URL that needs to be followed up to the engine, entering the Scheduler (scheduler) again.
Item Pipeline (pipeline): the place responsible for processing the Item obtained in Spider and performing post-processing (detailed analysis, filtering, storage, etc.).
Downloader Middlewares (download middleware): you can think of it as a component that can be customized to extend the download function.
Spider Middlewares (Spider Middleware): you can understand it as a functional component that can customize the extension and operation engine to communicate with Spider (such as Responses; into Spider and Requests out of Spider).
This is the answer to the question on how to understand the structure of the Scrapy framework in Python. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.