Python crawler-Scrapy 04/23 Update SLTechnology News&Howtos

Python crawler-Scrapy

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Python crawler-Scrapy, I believe that many inexperienced people do not know what to do about this, so this article summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Scrapy module

Scrapy Engine (engine): the core of the Scrapy framework. Responsible for communicating and transferring data between Spider and ItemPipeline, Downloader, Scheduler, etc.

Spider (crawler): send the link that needs to be crawled to the engine. Finally, the engine sends the data requested by other modules to the crawler, and the crawler parses the desired data. This part is written by the developers themselves, because it is up to the programmers to decide which links to climb and what data we need in the page.

Scheduler (Scheduler): responsible for receiving the requests sent by the engine, arranging and arranging them in a certain way, scheduling the order of requests, etc.

Downloader (downloader): responsible for receiving the download request from the engine, and then downloading the corresponding data on the network and returning it to the engine.

Item Pipeline (pipeline): responsible for saving the data passed by Spider (crawler). Exactly where to store it should depend on the developer's own needs.

Downloader Middlewares (download middleware): middleware that extends the communication function between the downloader and the engine.

Spider Middlewares (Spider middleware): middleware that extends the communication capabilities between the engine and the crawler.

Installation environment

# macOS environment

Need to install the compilation environment of c language

Xcode-select-install

Install Scrapy

Pip3 install Scrapy creates a project

Scrapy startproject xxx (project name)

Scrapy startproject firstProjectNew Scrapy project' firstProject', using template directory'/ usr/local/lib/python3.6/site-packages/scrapy/templates/project' Created in: / Users/baxiang/Documents/Python/Scrapy/firstProjectYou can start your first spider with: cd firstProject scrapy genspider example example.com project structure. ├── firstProject │ ├── _ _ init__.py │ ├── _ _ pycache__ │ ├── items.py │ ├── middlewares.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├─ ─ _ _ init__.py │ └── _ _ pycache__ └── scrapy.cfg

Items.py: a model used to store crawler data.

Middlewares.py: files used to store various middleware.

Pipelines.py: used to store items's model to a local disk.

Settings.py: some configuration information about the crawler (such as request header, how often to send a request, ip proxy pool, etc.).

Scrapy.cfg: configuration file for the project

Spiders package: all crawler code is stored in this

# Items.py

Define the data that needs to be crawled and post-processed

# settings.py

File settings scapy, two of which are recommended.

ROBOTSTXT_OBEY is set to False. The default is True. That is, abide by the machine protocol, then when crawling, scrapy first goes to find the robots.txt file, if not found. Then stop crawling directly.

DEFAULT_REQUEST_HEADERS add User-Agent. This also tells the server that my request is a normal request, not a crawler.

Pipeine.py

The function used to store post-processing data.

Create the contents of the news.py file

Name: the name of this reptile, the name must be unique.

Allow_domains: allowed domain name. The crawler will only crawl the pages under this domain name, and other pages that are not under this domain name will be automatically ignored.

Start_urls: the crawler starts with url in this variable, and the data downloaded for the first time will start with these urls.

Parse: the engine throws the data downloaded by the downloader to the crawler for parsing, and the crawler passes the data to the parse method. This is a fixed way of writing. This method can be used in two ways. The first is to extract the desired data. The second is to generate the url for the next request.

After reading the above, have you mastered the method of Python crawler-Scrapy? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.