How to use the Scrapy web crawler framework 07/08 Update SLTechnology News&Howtos

How to use the Scrapy web crawler framework

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article will explain in detail how to use the Scrapy web crawler framework. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Scrapy introduction

Standard introduction

Scrapy is an application framework written to crawl website data and extract structural data. It is very famous and very powerful. The so-called framework is a highly versatile project template that has been integrated with various functions (high-performance asynchronous download, queuing, distribution, parsing, persistence, etc.). For the study of the framework, the key point is to learn the characteristics of the framework and the usage of each function.

To speak human words is

As long as you are a crawler, you can van with this, because it integrates some great tools and has high crawling performance, and there are a lot of hooks reserved for easy expansion, so it is the best choice for home reptiles.

Install scrapy under windows

Command

Pip install scrapy

By default, direct pip install scrapy may fail, if there is no source change, plus temporary source installation try, here is Tsinghua source, common installation problems can refer to this article: Windows installation Scrapy method and common installation problems summary-Scrapy installation tutorial.

Command

Pip install scrapy-I https://pypi.tuna.tsinghua.edu.cn/simple

Scrapy creates a crawler project

Command

Scrapy startproject

Example: create an embarrassing encyclopedia crawler project (remember to cd to a clean directory ha)

Scrapy startproject qiushibaike

Note: at this point, we have created a crawler project, but the crawler project is a folder

Enter the crawler project

If you want to enter the project, you need to cd into this directory, as shown above, first cd, and then create the spider.

Analysis of project directory structure

At this point, we have entered the project, which is structured as follows, with a folder with the same name as the project and a scrapy.cfg file

Scrapy.cfg # scrapy configuration, use this to configure qiushibaike # folder with the same name of the project items.py # data storage template, customize the fields to be saved middlewares.py # crawler middleware pipelines.py # write data persistence code settings.py # configuration file, for example: control crawl speed, how much concurrency, etc. _ _ init__.py spiders # crawler directory Crawler files, write data parsing code _ _ init__.py

Well, maybe you can't understand the meaning of these catalogs at this time, but don't panic, you may understand when you use it, don't panic.

Create spiders

By doing the above, assume that you have successfully installed scrapy and entered the project you created

So, let's create a spider to crawl the jokes of the embarrassing encyclopedia.

Create Spider command

Scrapy genspider

Example: a joke spider to create an encyclopedia of embarrassing events

Scrapy genspider duanzi ww.com

Note: the starting url of a web page can be written or changed at will, but there must be

There will be one more duanzi.py file under the spider folder.

The code is explained as follows

Prepare before crawling data

After creating a spider, you need to configure something that cannot be crawled directly. By default, it cannot be crawled, and you need to configure it simply.

Open the settings.py file and find the ROBOTSTXT_OBEY and USER_AGENT variables

ROBOTSTXT_OBEY configuration

It means that False does not abide by the robot agreement. By default, only search engine sites are allowed to crawl, such as Baidu, Bing, etc., personal crawling needs to ignore this, otherwise it cannot be crawled.

USER_AGENT configuration

User-Agent is a parameter that a basic request must take. If this parameter is not normal, it must not be crawled.

User-Agent

Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36

Get embarrassing stories, encyclopedia jokes, links.

When the preparatory work is done, let's get started!

Here we need to have the grammar basis of xpath, in fact, it is quite simple, there is no basis to remember Baidu, in fact, not Baidu also does not matter, follow the study, probably can understand

Realize the function

Get the a tag connection under each paragraph through xpath

Note: censor elements and hold down crtl+f search content and write xpath here is no longer verbose

Analyze page rules

Through the review tool, we can see that the tags that class contains article are individual articles, and you may think that xpath might be able to write this.

Xpath code

/ / div [@ class='article']

But you will find that you can't find any of them, because it's an inclusive relationship, so you need to use the contains keyword.

We need to write like this.

Xpath code

/ / div [contains (@ class, "article")]

However, we will find that there is too much positioning, and it is not the div of each joke, so we have to include a few more, so it is the div of each joke.

/ / div [contains (@ class, "article") and contains (@ class, "block")]

The above-mentioned jokes have been successfully located one by one. On this basis, locate the a tag under each joke.

According to the censorship element, it is found that the a tag of class= "contentHerf" under each paragraph is the details page of each paragraph.

Details page. The href of the a tag to be located is indeed the url of the details page.

Xpath code

/ / div [contains (@ class, "article") and contains (@ class, "block")] / / a [@ class= "contentHerf"]

In this way, we locate the a tags one by one, and it's okay to operate at least on the console, so let's use Python code to manipulate it.

Code

Def parse (self, response): a_href_list = response.xpath ('/ / div [contains (@ class, "article") and contains (@ class, "block")] / a [@ class= "contentHerf"] / @ href') .extract () print (a_href_list)

Start the spider command

Scrapy crawl [--nolog]

Note:-- the nolog parameter does not indicate a series of logs and is generally used for debugging. Adding this parameter means that only print content is entered.

Example: start a sub-command

Scrapy crawl duanzi-nolog

Successfully got every link.

Get the details page content

In the above, we have successfully obtained the link to each joke, but we will find that some of the jokes are incomplete, and we need to enter the details page to see all the jokes, so let's use crawlers to manipulate them.

Let's define the title and content.

According to the element review, the positioning xpath of the title is:

/ / H2 [@ class= "article-title"]

The xpath of the content is:

/ / div [@ class= "content"]

After determining the xpath location of the title and content, let's implement it in the python code.

Note: but solve one problem first. The details page belongs to the second call, so we also need to make the second call before writing the code.

Code

# details page def detail (self, response): title = response.xpath ('/ / H2 [@ class= "article-title"] / text ()'). Extract () content = response.xpath ('/ / div [@ class= "content"] / / text ()'). Extract () print ("title:") print (title) print ("content") print (content) def parse (self) Response): a_href_list = response.xpath ('/ / div [contains (@ class, "article") and contains (@ class) "block")] / a [@ class= "contentHerf"] / @ href') .extract () print (a_href_list) base_url = "https://www.qiushibaike.com" for a_href in a_href_list: url= f" {base_url} {a_href} "yield scrapy.Request (url=url, callback=self.detail)

Result

But we will find that each seems to be in the form of a list, which doesn't seem to work. Let's modify the code a little so that what we get is the normal text, as shown in the following figure:

Summary of the above order

Create a crawler project

Scrapy startproject

Create spiders

Scrapy genspider

Start the crawler. The-- nolog parameter does not indicate a series of logs, which is generally used for debugging. Adding this parameter means that only print content is entered.

Scrapy crawl [--nolog] on "how to use the Scrapy web crawler framework" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.