Example Analysis of Scrapy Web Crawler Framework 04/26 Update SLTechnology News&Howtos

Example Analysis of Scrapy Web Crawler Framework

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article will explain in detail the example analysis of the Scrapy web crawler framework. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

1. Scrapy crawler framework

Scrapy is a crawler framework written in the Python programming language. Anyone can modify it according to their own needs, and it is very convenient to use. It can be used in data acquisition, data mining, network abnormal user detection, data storage and so on.

Scrapy uses the Twisted Asynchronous Network Library to handle network traffic. The overall architecture is roughly shown in the following figure.

2. As can be seen from the figure above, the Scrapy crawler framework is mainly composed of five parts, namely: Scrapy Engine (Scrapy engine), Scheduler (scheduler), Downloader (downloader), Spiders (spider) and Item Pipeline (project pipeline). In the crawling process, the Scrapy engine sends the request, and then the scheduler gives the initial URL to the downloader, and then the downloader sends a service request to the server. After getting the response, the downloaded web page content is handed over to the spider for processing, and then the spider will parse the web page in detail. There are two results of spider analysis: one is to get a new URL, and then request the scheduler again to start a new round of crawling, repeating the above process; the other is to get the required data, and then transfer it to the project pipeline to continue processing. The project pipeline is responsible for the post-processing of data cleaning, verification, filtering, de-duplication and storage, and finally output to the file by Pipeline, or store it in the database.

3. The functions of these five components and their middleware are as follows:

1) Scrapy engine: controls the data processing flow of the whole system, triggers the transaction flow, and is responsible for concatenating each module.

2) Scheduler (scheduler): maintains the URL queue to be crawled. When accepting the request sent by the engine, the next URL is taken from the URL queue to be crawled and returned to the scheduler.

3) Downloader (downloader): send a request to the web server to download the web page, which is used to download the web page content and hand it over to the spider for processing.

4) Spiders (Spider): make the address of the website to be crawled, select the required data content, define domain name filtering rules and page parsing rules, etc.

5) Item Pipeline (project pipeline): processing data extracted from web pages by spiders, the main tasks are cleaning, verification, filtering, de-duplicating and storing data, etc.

6) Middleware (Middlewares): middleware is a component between Scrapy engine and Scheduler,Downloader,Spiders, which mainly deals with requests and responses between them.

Scrapy crawler framework can be very convenient to complete the online data collection work, simple and lightweight, very convenient to use.

4. Design and implementation of web crawler based on Scrapy.

On the basis of understanding the principle and framework of Scrapy crawler, this section briefly introduces the data acquisition process of Scrapy crawler framework.

4.1 create a crawler project file

Based on the scrapy crawler framework, simply enter the "scrapy startproject article" command on the command line, and a crawler project named article will be created automatically. First go to the article folder, enter the command "cd article", and then view the directory through "dir". You can also generate the tree structure of the file directory through "tree / f". As shown in the following figure, you can clearly see the files generated by the Scrapy creation command.

The top-level article folder is the project name, and the second layer contains a folder with the same name as the project article and a file scrapy.cfg, this folder with the same name as the project article is a module, all project code is added in this module, and the scrapy.cfg file is the configuration file for the entire Scrapy project. In the third layer, there are five files and a folder, of which _ _ init__.py is an empty file whose function is to turn its parent directory into a module; items.py is a file that defines storage objects and decides which items to crawl; middlewares.py files are middleware and generally do not need to be modified, and are mainly responsible for requests and responses between related components; pipelines.py is a pipeline file that determines how to process and store crawled data Settings.py is the setting file of the project, which sets the processing method of project pipeline data, crawler frequency, table name, etc.; what is placed in the spiders folder is the crawler body file (used to implement crawler logic) and an empty _ _ init__.py file.

4.2After that, we will analyze the web page structure and data, modify the Items.py file, write the hangyunSpider.py file, modify the pipelines.py file, and modify the settings.py file. The specific operation of these steps will be specially carried out in the later stage, so I will not repeat them here.

4.3 execute the crawler

After modifying the above four files, enter the cmd command in the Windows command window to enter the path where the crawler is located, and execute the "scrapy crawl article" command, so that you can run the crawler, and finally save the data to the local disk.

This is the end of the article on "sample analysis of Scrapy web crawler framework". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.