In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
How to carry out the analysis of the Scrapy framework, I believe that many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
Scrapy framework
Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses. The power of the framework, users only need to customize the development of a few modules can easily achieve a crawler, used to grab web content and a variety of pictures, very convenient. Scrapy uses Twisted ['tw asynchronous st communication] (its main rival is Tornado) asynchronous network framework to deal with network communications, which can speed up our download speed, do not have to implement the asynchronous framework ourselves, and include a variety of middleware interfaces, which can flexibly complete various requirements.
Scrapy architecture diagram (the green line is the data flow):
1. Scrapy Engine (engine): responsible for communication, signal and data transmission among Spider, ItemPipeline, Downloader and Scheduler.
2. Scheduler (Scheduler): it is responsible for receiving Request requests sent by the engine, arranging them in a certain way, joining the queue, and returning them to the engine when needed.
3. Downloader (downloader): responsible for downloading all Requests requests sent by Scrapy Engine (engine) and returning the obtained Responses to Scrapy Engine (engine), which will be handed over by the engine to Spider for processing
4. Spider (crawler): it is responsible for processing all Responses, analyzing and extracting data from it, obtaining the data needed in the Item field, submitting the URL that needs to be followed up to the engine, and entering the Scheduler (scheduler) again.
5. Item Pipeline (pipeline): it is responsible for processing the Item obtained in Spider and performing post-processing (detailed analysis, filtering, storage, etc.).
6. Downloader Middlewares (download middleware): you can think of it as a component that can customize and extend the download function.
7. Spider Middlewares (Spider middleware): you can understand that it is a functional component that can customize the extension and operation engine to communicate with Spider (such as Responses; into Spider and Requests out of Spider).
The operation process of Scrapy
When the code is written, the program starts to run.
1. Engine: Hi! Spider, which website do you want to deal with?
2. Spider: the boss asked me to deal with xxxx.com.
3. Engine: you give me the first URL you need to deal with.
4. Spider: here you are. The first URL is xxxxxxx.com.
5. Engine: Hi! Dispatcher, I have request asking you to sort me into the queue.
6. Dispatcher: OK, I'm working on it. Wait a minute.
7. Engine: Hi! Dispatcher, give me the request request you processed.
8. Dispatcher: here you are, this is the request I took care of
9. Engine: Hi! Downloader, you help me download this request request according to the boss's download middleware settings.
10. Downloader: all right! Here you are. Here's what you downloaded. (if it fails: sorry, the request download fails. Then the engine tells the scheduler that the request download failed. Record it and we'll download it later.)
Engine: Hi! Spider, this is the downloaded thing, and has been handled according to the boss's download middleware, you take care of it yourself (pay attention! Here responses is handled by default to the function def parse ().
12. Spider: (for URL that needs to be followed up after processing the data), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I got.
Engine: Hi! I have an item in the pipe. You take care of it for me! Dispatcher! This is a follow-up, URL. You take care of it for me. Then start the cycle from step 4 until the boss needs all the information.
14. Pipeline ``Scheduler: OK, do it now!
Note: the entire program stops only if the scheduler has no request to process. (URL,Scrapy that failed to download will also be downloaded again. )
It takes 4 steps to make a Scrapy crawler:
New Project (scrapy startproject xxx): create a new reptile project
Clear goals (write items.py): identify the goals you want to capture
Make a spiders/xxspider.py: make a crawler and start crawling a web page
Storage content (pipelines.py): design pipeline to store crawling content
After reading the above, have you mastered the method of how to analyze the Scrapy framework? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.