In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
Scrapy entry-level tutorial is what kind of, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.
Scrapy is an application framework written in Python to crawl website data and extract structural data.
Scrapy is often used in a series of programs, including data mining, information processing, or storing historical data.
Usually we can simply implement a crawler through the Scrapy framework to capture the content or images of a specified website.
ScrapyEngine (engine): responsible for communication, signal and data transmission among Spider, ItemPipeline, Downloader and Scheduler.
Scheduler (Scheduler): it is responsible for receiving Request requests sent by the engine, arranging them in a certain way, joining the queue, and returning them to the engine when needed.
Downloader (downloader): responsible for downloading all Requests requests sent by ScrapyEngine (engine) and returning the obtained Responses to ScrapyEngine (engine), which will be handed over to Spider for processing
Spider (crawler): it is responsible for processing all Responses, analyzing and extracting data, getting the data needed in the Item field, and submitting the URL that needs to be followed up to the engine, and entering the Scheduler (scheduler) again.
ItemPipeline (pipeline): it is responsible for processing the Item obtained in the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).
DownloaderMiddlewares (download middleware): you can think of it as a component that can be customized to extend the download function.
SpiderMiddlewares (Spider Middleware): you can understand it as a functional component that can customize the extension and operation engine to communicate with Spider (such as Responses; into Spider and Requests out of Spider).
The operation process of Scrapy
When the code is written, the program starts to run.
1 engine: Hi! Spider, which website do you want to deal with?
2Spider: the boss asked me to deal with xxxx.com.
Engine 3: give me the first URL you need to deal with.
4Spider: here you are. The first URL is xxxxxxx.com.
5 engine: Hi! Dispatcher, I have request asking you to sort me into the queue.
6 Scheduler: OK, I'm working on it. Wait a minute.
7 engine: Hi! Dispatcher, give me the request request you processed.
8 Scheduler: here you are, this is the request I took care of
9 engine: Hi! Downloader, you help me download this request request according to the boss's download middleware settings.
10 downloader: all right! Here you are. Here's what you downloaded. (if it fails: sorry, the request download fails. Then the engine tells the scheduler that the request download failed. Record it and we'll download it later.)
11 engine: Hi! Spider, this is the downloaded thing, and has been handled according to the boss's download middleware, you take care of it yourself (pay attention! Here responses is handled by default to the function defparse ().
12Spider: (for URL that needs to be followed up after processing the data), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I got.
13 engine: Hi! I have an item in the pipe. You take care of it for me! Dispatcher! This is a follow-up, URL. You take care of it for me. Then start the cycle from step 4 until the boss needs all the information.
14 Pipeline Scheduler: OK, do it now!
Be careful! Only when there is no request in the scheduler will the entire program stop. (that is, the URL,Scrapy that failed to download will be downloaded again. ) it takes 4 steps to make a Scrapy crawler:
New Project (scrapystartprojectxxx): create a new reptile project
Clear goals (write items.py): identify the goals you want to capture
Make a spiders/xxspider.py: make a crawler and start crawling a web page
Storage content (pipelines.py): design pipeline to store crawling content
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.