A brief introduction to the scrapy Framework 07/11 Update SLTechnology News&Howtos

A brief introduction to the scrapy Framework

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "brief introduction of scrapy framework". In daily operation, I believe many people have doubts about the brief introduction of scrapy framework. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts of "brief introduction of scrapy framework"! Next, please follow the editor to study!

Five basic components of 1.Scrapy

The Scrapy framework is mainly composed of five components, namely, scheduler (Scheduler), downloader (Downloader), crawler (Spider) and entity pipeline (Item Pipeline), Scrapy engine (Scrapy Engine).

Scheduler: it can be assumed as a priority queue of URL, which determines the next URL to be crawled and removes duplicate URLs.

Downloader: the most burdensome of all components, used to download resources on the network at high speed

Crawler: users are most concerned about the part. Users customize their own crawler, which is used to extract the information they need from a specific web page. They can also extract links from it and let Scrapy crawl the next page.

Entity pipeline: used to process the entity extracted by the crawler, the function is to persist the entity, verify the validity of the entity, and clear the unwanted information

Scrapy engine: is the core of the entire framework, used to control debuggers, downloaders, crawlers, in fact, the engine is equivalent to the CPU of the computer, controlling the whole process

two。 Crawling web page data using scrapy framework

The first step: first to use the scrapy framework need to install it, you can use pip to install the scrapy framework, note that if the Windows system directly using the pip command line installation may report an error, then you need to manually install several dependent libraries such as wheel, lxml, Twisted, pywin32, etc., the error message will prompt you which library is missing.

Here, I would like to mention the installation of the Twisted plug-in. The download address is https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted. After entering, find twisted, and select download the corresponding version, where cp represents the python version. After the download is completed, enter the terminal and enter pip install Twisted-19.2.0-cp37-cp37m-win_amd64.whl. Here, please enter the file name of the version you downloaded. After the installation is complete, type pip install scrapy to successfully install the scrapy framework

Step 2: create a crawler project, create and store the scrapy folder scrapy_python, then cd enter the project path in the command line tool, and use the scrapy startproject name command to create a new project

So we successfully created a scrapy project, which we took a look at in PyCharm

Step 3: you can create a spider file in the spiders folder of the project you just created, which is used to crawl the web page data. Let's try to crawl the csdn website, then the command line for the new spider is: scrapy genspider csdn www.csdn.net, where csdn is the file name of the spider you created, and www.csdn.net represents the domain name of the target URL crawled. Use the domain name of any website you want to crawl.

Step 4: if we want to start the spider file we created, we can use the command line: scrapy crawl csdn, where csdn is the corresponding value of name in the spider file

Step 5: to test whether the crawled data is successful, we can create a test file in the project template, such as start_spider.py, and then debug the project through debug to output the web page data we want to crawl

From scrapy.cmdline import executeexecute ([scrapy "," crawl "," csdn ",])

Step 6: when crawling data, you need to follow the crawler protocol, which is used to limit the scope of content that the crawler can crawl. It is located in the settings.py file of the scrapy project by default ROBOTSTXT_OBEY = True, that is, to abide by this agreement. When the content we want to crawl does not conform to the protocol, we can set ROBOTSTXT_OBEY = False to indicate that we do not comply with this protocol.

Step 7: so we can start using Xpath selector or CSS selector to parse the page data we want to crawl

Introduction of 3.Xpath selector

The full name of XPath is XML Path Language, or XML path language, which is a language for locating information in structured documents. XPath uses path expressions to select nodes or node sets in XML documents. Nodes are selected by following the path (path) or step (steps)

The predicate is used to find a specific node or a node that contains a specified value. The predicate is embedded in square brackets, for example, / / body//a [1] means to select the first an element that belongs to the body child element, / / a [@ href] means to select all an elements that have attributes named href, etc.

In addition to indexes and attributes, Xpath can also use convenient functions to enhance the accuracy of location. For example, contains (S1 ~ S2) returns true if S1 contains S2, otherwise it returns false, text () to get the text content in the node, and starts-with () to match the string from the starting position.

The common syntax for selecting nodes using XPath is

Expression Writing expression meaning * Select any node in the HTML page / select from the root node / / select the nodes in the document from the current node of the matching selection, regardless of their location. Select the current node.. Select the parent node of the current node / bookstore/book [1] Select the first book element / bookstore/book [last ()] that belongs to the bookstore child element, select the last book element / bookstore/ booklet [last ()-1] that belongs to the bookstore child element, select the penultimate book element / / title [@ lang] that belongs to the bookstore child element, select all title elements with attributes named lang / / title [@ lang = 'eng'] select all title elements And these elements have a lang attribute with a value of eng / bookstore/book [price > 35.0000] to select all book elements of the bookstore element, and the value of the price element must be greater than all the title elements of the book element in the 35.00/bookstore/book [price > 35.0000] / title select bookstore element, and the value of the price element must be greater than 35.00Uniplicat * all elements in the selection document / / title [@ *] Select all title elements with attributes / / book/title / / book/price selects all title and price elements of the book element / / title, / / price selects all title and price elements in the document child::book selects all book nodes belonging to the child elements of the current node child::text () selects all text children of the current node / bookstore/book/title selects all title nodes / bookstore/book/price/text () selects all text in the price node / / * any element uses XPath

Let's use the XPath selector to crawl the information we want to crawl from the site. Here's the picture. Let's crawl through the headlines in today's recommendation.

Import scrapyclass CsdnSpider (scrapy.Spider): name = 'csdn' allowed_domains = [' www.csdn.net'] start_urls = ['http://www.csdn.net/'] def parse (self) Response): # Select the text result = response.xpath of an element under all the h4 elements of class= "company_name" ('/ / h4 [@ class= "company_name"] / a/text ()'). Extract () # will loop the resulting text list for i in result: print (I)

Let's take a look at the output and print and see if it's the result we want.

At this point, the study of "A brief introduction to the scrapy Framework" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.