Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python scrapy crawler

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use Python scrapy crawler". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use Python scrapy crawler".

Project requirements

Crawl for famous aphorisms on a website specially designed for crawler beginners to train crawlers.

Create a project

Before you can start crawling, you must create a new Scrapy project. Go to the directory where you want to store the code and run the following command:

(base) λ scrapy startproject quotes New Scrapy project' quotes', using template directory'd:\ anaconda3\ lib\ site- packages\ scrapy\ templates\ project', created in: d:\ course-crawler course\ 02 framework crawler\ lesson preparation code-frame crawler\ quotes You can start your first spider with: cd quotes scrapy genspider example example.com

First change to the newly created crawler project directory, that is, the / quotes directory. Then execute the command to create the crawler:

D:\ course-master course\ 02 frame crawler\ lesson preparation code-frame crawler (master) (base) λ cd quotes\ D:\ course-frame crawler\ 02 frame crawler\ lesson preparation code-frame crawler\ quotes (master) (base) λ crawler:\ course-frame crawler\ lesson preparation code-frame crawler\ quotes ( Master) (base) λ scrapy genspider quote quotes.com Created spider 'quote' using template' basic' in module: quotes.spiders.quote

This command will create a quotes directory that contains the following:

Quotes │ items.py │ middlewares.py │ pipelines.py │ settings.py │ _ _ init__.py │ ├─ spiders │ quote.py │ _ _ init__.pyrobots.txt

The robots protocol, also known as robots.txt (uniform lowercase), is an ASCII-encoded text file stored in the root directory of a website, which usually tells web spiders of web search engines what content in this site should not be accessed by search engine crawlers and which can be accessed by crawlers.

The robots protocol is not a specification, but a convention.

# filename:settings.py # Obey robots.txt rules ROBOTSTXT_OBEY = False Analysis Page

Before writing a crawler program, we first need to analyze the crawled page. Mainstream browsers have tools or plug-ins to analyze the page. Here we choose the Chrome browser developer tool (Tools → Developer tools) to analyze the page.

Data information

Open the page http://quotes.toscrape.com in a Chrome browser and select Review elements to view its HTML code.

You can see that each label is wrapped in a label.

Write spider

After analyzing the page, write the crawler. Write a crawler in Scrapy and code in scrapy.Spider Spider is a class that users write to crawl data from a single Web site (or some sites).

It contains an initial URL for download, how to follow the links in the web page and how to analyze the content of the page, and how to extract the method to generate item.

To create a Spider, you must inherit the scrapy.Spider class and define the following three properties:

Name: used to distinguish between Spider. The name must be unique and you cannot set the same name for different Spider.

Start_urls: contains a list of url where Spider crawls at startup. Therefore, the first page to be retrieved will be one of them. The subsequent URL is extracted from the data obtained by the initial URL.

Parse (): is a method of spider. When called, the Response object generated after each initial URL finishes downloading will be passed to the function as the only argument. This method is responsible for parsing the returned data (response data), extracting the data (generating item), and generating Request objects for URL that need further processing.

Import scrapy class QuoteSpider (scrapy.Spider): name = 'quote' allowed_domains = [' quotes.com'] start_urls = ['http://quotes.toscrape.com/'] def parse (self, response): pass

The following is a brief description of the implementation of quote.

Key points:

Name is the name of the crawler and is specified during genspider.

Allowed_domains is a domain name that can be crawled by crawlers. Crawlers can only crawl web pages under this domain name without writing.

Start_urls is a website crawled by Scrapy and can be iterated. Of course, if there are multiple web pages, multiple URLs can be written into the list.

The form of table derivation.

Parse is called a callback function, and the response in this method is the response to the request made by the start_urls URL. Of course, you can also specify other functions to receive the response. A page parsing function usually needs to complete the following two tasks:

Extract the data from the page (re, XPath, CSS selector) extract the link in the page and generate a download request for the linked page.

The page parsing function is usually implemented as a generator function, and every item of data extracted from the page and every download request to the linked page

Submitted to the Scrapy engine by the yield statement.

Parsing data import scrapy... Def parse (self, response): quotes = response.css ('.text:: text') for quote in quotes: text= quote.css (' .text:: text'). Extract_first () auth= quote.css ('.author:: text'). Extract_first () tages= quote.css ('. Tags) yield dict (text=text, auth=auth, tages=tages) focuses on:

Response.css () directly uses the css syntax to extract the data in the response.

Multiple URLs can be written in start_urls and separated in list format.

Extract () is to extract the data from the css object, which is then a list, otherwise it is an object. And for extract_first () is the first to extract the

Run the crawler

Run scrapy crawl quotes in the / quotes directory to run the crawler project

What happens after running the crawler?

Scrapy creates a scrapy.Request object for each URL in the start_urls attribute of Spider and assigns the parse method to Request as a callback function (callback).

The Request object is scheduled to generate the scrapy.http.Response object and send it back to the spider parse () method for processing.

After completing the code, run the crawler to crawl the data, execute the scrapy crawl command in shell, run the crawler 'quote', and store the crawled data in the csv file:

(base) λ scrapy crawl quote-o quotes.csv 2020-01-08 20:48:44 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: quotes)....

After waiting for the crawler to finish running, a quotes.csv file will be generated in the current directory, and the data in it will be stored in csv format.

-o support saving to multiple formats. The way to save it is also very simple, just give the suffix name of the file. (csv, json, pickle, etc.

Thank you for your reading, the above is the content of "how to use Python scrapy crawler", after the study of this article, I believe you have a deeper understanding of how to use Python scrapy crawler, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report