Brief introduction and example usage of Scrapy Framework of Python 07/09 Update SLTechnology News&Howtos

Brief introduction and example usage of Scrapy Framework of Python

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the knowledge of "Python Scrapy Framework Brief Introduction and Example Usage". In the actual case operation process, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

directory

A brief introduction to Scrapy Framework

Create Scrapy Project

Creating Spider Crawlers

Spider crawlers extract data

items.py Code Definition Field

fiction.py code extract data

pipelines.py code saves data

settings.py code starts crawler

results are shown

A brief introduction to Scrapy Framework

Scrapy framework is an asynchronous processing framework based on Twisted, a crawler framework implemented in pure Python, an application framework written to extract structural data, with clear architecture, low coupling between modules, and strong scalability. We only need a small amount of code to be able to quickly grab data.

Its framework is shown below:

Scrapy Engine is the core of the framework, and the only modules involved in writing code are Pipeline Item module and Spiders module.

Create Scrapy Project

First we create the Scrapy project by using the following code, the execution code is shown below:

Scrapy startproject Fiction

The operation results are shown in the following figure:

As you can see from the above figure, we have created a new Scrapy project on drive C, the project name is Fiction, and we are also prompted to create the first Spider crawler by the following command, the command is as follows:

cd Fiction #Enter directory scrapy genspider example example.com #Create spider crawler

Where example is our crawler name, example.com is the scope of crawler crawling, that is, the domain name of the website.

The contents of the Fiction folder are shown below:

Creating Spider Crawlers

In the above steps we successfully created a Scrapy project, and know how to create Spider crawler, next we create Spider crawler named fiction, its domain name is www.17k.com, the code is as follows:

scrapy genspider fiction www.17k.com

After running, the spiders folder has more fiction.py that we just created. This is the Spider crawler we created.

As shown below:

See so many py files is not panic, in fact, do not panic, in general, we mainly create spider crawler files, items.py and pipelines.py to write code, where:

fiction.py: mainly write code to define the logic of crawling, parse the response and generate extraction results and new requests;

items.py: mainly define the fields for crawling data first, avoid spelling mistakes or defining field errors, of course, we can not define the fields first, but directly define them in fiction.py.

Pipelines.py: mainly write data cleaning, validation and data storage code, when we store data in csv, xml, pickle, marshal, json and other files, there is no need to write code in pipelines.py. Just execute the following code:

scrapy crawl fiction filename. suffix

When the data needs to be stored in the MongoDB database, write the following code:

from pymongo import MongoClientclient=MongoClient()collection=client["Fiction"]["fiction"]class Test1Pipeline: def process_item(self, item, spider): collection.insert(item) return itemSpider Crawlers extract data

Before extracting the data, first we go to the website to crawl the novel and open the developer tool, as shown in the following figure:

We can see from the above figure that all the chapter names of our novels are stored. Clicking on the chapter can jump to the corresponding chapter page, so we can use Xpath to crawl through this div as our xpath, and traverse through the for loop to obtain the name and URL link of each chapter.

After jumping to the chapter content page, open the developer tools, as shown below:

As you can see from the above figure, the novel content is stored in it, and we can traverse all the contents of the chapter in the dl by using the for loop, and of course by using Xpath to obtain it.

items.py Code Definition Field

Careful friends found that we need to advance the fields are chapter name, chapter URL link and chapter content, in which chapter name and chapter content are required to save data, so you can first define the field name in items.py file, the specific code is as follows:

import scrapyclass FictionItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() text = scrapy.Field()

The definition of the field is very simple, the field name =scrapy.Field().

Yes, one of the best benefits of defining fields in items.py is that when we get the data, we use different items to store different data, and when we hand the data to pipeline, we can judge which item the data belongs to by isinstance(item,FictionItem) and perform different data (item) processing.

After defining the fields, this is what we do by writing code in the pipeline.py file to distinguish between different item data. The specific code is as follows:

from Fiction.items import FictionItemclass FictionPipeline: def process_item(self, item, spider): if isinstance(item,FictionItem): print(item)

Of course, in the project we crawled, we only needed a class, and the code above was just to show how to distinguish which item the data belonged to.

fiction.py code extract data

The fiction.py file is also the spider crawler we created. Open the fiction.py file, and its code content is as follows:

import scrapyclass FictionSpider(scrapy.Spider): name = 'fiction' allowed_domains = ['www.17k.com'] start_urls = ['http://www.17k.com/'] def parse(self, response): pass

Of which:

name is a string defining the name of the crawler, a unique name for each item, used to distinguish different spiders, use scrapy crawl + the crawler name when starting the crawler;

allowed_domains are domain names that allow crawling, preventing crawlers from crawling to other websites;

start_urls is the url that was crawled first;

The parse() method is responsible for parsing the return response, extracting data, or generating further requests to be processed. Note that the name of this method cannot be changed.

After understanding the various parts of the file content, we began to extract the chapter name and chapter URL link of the first page. The specific code is as follows:

import scrapyfrom Fiction.items import FictionItemclass FictionSpider(scrapy.Spider): name = 'fiction' allowed_domains = ['www.17k.com'] start_urls = ['https://www.17k.com/list/2536069.html'] def parse(self, response): html = response.xpath('//dl[@class="Volume"]') books = html.xpath('./ dd/a') for book in books: item =FictionItem() item['name'] = [] name = book.xpath('./ span/text()').extract() for i in name: item['name'].append(i.replace('\n', '').replace('\t', '')) href = book.xpath('./@ href').extract_first() href = 'https://www.17k.com' + href yield scrapy.Request(url=href, callback=self.parse_detail, meta={'item': item})

First import FictionItem, then we modify the start_urls link to the URL link to be crawled, in the parse() method, use xpath to get the chapter name and chapter URL link, call FictionItem() through the for loop, and then store the chapter name in item.

Call scrapy.Request() method via generator yield return, where:

url=href: indicates the URL link to be crawled next;

callback: indicates that parse_detail function is specified as parse processing;

Meta: Implements passing data in different parsing functions.

In the previous step, we specified parse_detail function as parsing processing. Next, we will write parse_detail function to obtain chapter content. The specific code is as follows:

def parse_detail(self,response): string="" item=response.meta['item'] content=response.xpath('//*[@id="readArea"]/div[1]/div[2]//p/text()').extract() for i in content: string=string+i+'\n' item['text']=string yield item

First we define an empty variable string to receive item data via response.meta[], with meta={'item': item} as the parameter in the previous step, then get the chapter content, finally store the chapter content in item <$'text '], and return the data to the engine via the generator yield.

pipelines.py code saves data

The chapter name and chapter content have all been obtained. Next, we save the obtained data as a txt file. The specific code is as follows:

from Fiction.items import FictionItemimport timeclass FictionPipeline: def open_spider(self, spider): print(time.time()) def process_item(self, item, spider): if isinstance(item, FictionItem): title = item['name'] content = item['text'] with open(f'small say/{title[0]}.txt', 'w', encoding='utf-8')as f: f.write(content) def close_spider(self, spider): print(time.time())

First we import FictionItem and time, write code in the open_spider() and close_spider() methods to call time.time() to get the start time and end time of crawling, and then in the process_item() method, store the item <$'name '] and item <$'text'] returned by the engine in title and content respectively, and open the txt file by opening it, and call write() to write the chapter content in the txt file.

settings.py code starts crawler

Before starting the crawler, we first need to start the engine in the settings.py file. The starting method is very simple. Just find the code in the following figure and uncomment the code:

One might ask: Where is the User-Agent set up? We can set User-Agent in settings.py file, the specific code is as follows:

Okay, all the code has been written, the next will start the crawler, the execution code is as follows:

scrapy crawl fiction

After starting the crawler, we found that there are many log data outputs in our console. At this time, we can block these log logs by adding the following code to settings.py:

LOG_LEVEL="WARNING" Results

"Python Scrapy framework simple introduction and example usage" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.