What is the basic knowledge of Scrapy 07/19 Update SLTechnology News&Howtos

What is the basic knowledge of Scrapy

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

What is the basic knowledge of Scrapy, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

We will finish the basic knowledge of Scrapy here.

Introduction to Architectur

The following is the architecture of Scrapy, including components and an overview of the data flow that occurs in the system (shown by the red arrow). After that, each component will be briefly introduced, and the data flow will also be briefly described.

That's the way the architecture is, and the process is similar to the mini-architecture I introduced in the second article, but it's very scalable.

One more thing

Scrapy startproject tutorial

This command will create a tutorial directory that contains the following:

Configuration file for the tutorial/ scrapy.cfg # project tutorial/ # the python module for the project Then you will add the code _ _ init__.py items.py # item file in the project pipelines.py # pipelines file in the project settings.py # project settings file spiders/ # the directory where the spider code is placed _ _ init__.py

Write the first crawler

Spider is a class that users write to crawl data from a single site (or some sites). It contains an initial URL for download, as well as how to follow the links in the page and how to analyze the content on the page.

The following is our * Spider code, which is saved in the quotes_spider.py file in the tutorial/spiders directory:

Import scrapy class QuotesSpider (scrapy.Spider): name = "quotes" def start_requests (self): urls = ['http://quotes.toscrape.com/page/1/',' http://quotes.toscrape.com/page/2/',] for url in urls: yield scrapy.Request (url=url, callback=self.parse) def parse (self Response): page = response.url.split ("/") [- 2] filename = 'quotes-%s.html'% page with open (filename,' wb') as f: f.write (response.body) self.log ('Saved file% s'% filename)

Run our reptiles.

Go to the root directory of the project and execute the following command to start spider:

Scrapy crawl quotes

This command starts the spider used to crawl quotes.toscrape.com, and you will get similar output:

2017-05-10 20:36:17 [scrapy.core.engine] INFO: Spider opened 2017-05-10 20:36:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min) Scraped 0 items (at 0 items/min) 2017-05-10 20:36:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.1 at 6023 2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (referer: None) 2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (referer: None) 2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (referer: None) 2017-05-10 20:36:17 [quotes] DEBUG: Saved file quotes-1.html 2017-05-10 20:36:17 [quotes] DEBUG: Saved file quotes-2.html 2017-05-10 20:36:17 [scrapy.core.engine] INFO: Closing spider (finished)

Extract data

We just saved the HTML page and didn't extract the data. Now upgrade the code to add the extraction function. As for how to use the browser's developer mode to analyze web pages, it has been introduced before.

Import scrapy class QuotesSpider (scrapy.Spider): name = "quotes" start_urls = ['http://quotes.toscrape.com/page/1/',' http://quotes.toscrape.com/page/2/',] def parse (self Response): for quote in response.css ('div.quote'): yield {' text': quote.css ('span.text::text'). Extract_first (),' author': quote.css ('small.author::text'). Extract_first (),' tags': quote.css ('div.tags a.tagpurpurl text'). Extract () }

Run the crawler again and you will see the extracted data in the log:

2017-05-10 20:38:33 [scrapy.core.scraper] DEBUG: Scraped from {'tags': [' life', 'love'],' author': 'Andr é Gide',' text':'"It is better to be hated for what you are than to be loved for what you are not."'} 2017-05-10 20:38:33 [scrapy.core.scraper] DEBUG: Scraped from {'tags': [' edison', 'failure',' inspirational', 'paraphrased'] 'author': 'Thomas A. Edison',' text': "" I have not failed. I've just found 10000 ways that won't work. ""}

Save crawled data

The easiest way to store crawled data is to use Feed exports:

Scrapy crawl quotes-o quotes.json

This command serializes the crawled data in JSON format to generate a quotes.json file.

If you need to do more and more complex operations on the crawled item, you can write that the Item Pipeline,tutorial/pipelines.py has been created automatically at the beginning.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.