Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use the Scrapy framework in Python crawler

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use the Scrapy framework in Python crawler. It is very detailed and has a certain reference value. Friends who are interested must read it!

Previously on why use the Scrapy framework?

The first two postgraduate articles introduced the concept and practice of multithreading.

Multithreaded web page crawling

Multi-thread crawling web page project

After previous study, we basically mastered the analysis of pages, analysis of dynamic requests, crawling content, but also learned to use multi-threads to crawl web pages concurrently to improve efficiency. These skill points are enough for us to write all kinds of crawlers that meet our requirements.

But we still have one unsolved problem, that is, engineering. Engineering can liberate the process of writing code from "thinking and writing paragraph by paragraph" and become orderly, unified style, and do not write repetitive things.

And Scrapy is the leader in the crawler framework. It gives us a lot of steps and corners to deal with in advance, and users can concentrate on the core things like parsing pages and analyzing requests.

II. The concept of Scrapy framework

Scrapy is a pure Python implementation, popular web crawler framework, it uses some advanced features to simplify the crawling of web pages, can make our crawler more standardized and efficient.

It can be divided into the following parts

Component function Scrapy EngineScrapy engine, responsible for controlling the data flow and event trigger Scheduler scheduler of the whole system, receiving requests from Scrapy engine and queuing them, waiting for the subsequent needs of the engine to use Downloader downloader to crawl web page content, and returning the crawled data to Spiders (crawler) Spiders crawler. This part is the core code, which is used to parse and extract the required data Item Pipeline data pipeline and process the extracted data. Mainly data cleaning, verification and data storage Downloader middlewares downloader middleware, processing requests and responses between Scrapy engine and downloader Spider middlewares crawler middleware, processing crawler input responses and output results or new requests

The process of data flow in Scrapy is as follows

Step data flow 1 engine opens a website, finds the corresponding crawler that processes the website, and crawls the first page address from the page 2 engine gets the first page address from the crawler and puts it into the scheduler as a request to schedule 3 engine to obtain the address of the next page from the scheduler 4 the scheduler returns the address of the next page to the Scrapy engine After the Scrapy engine passes it to the downloader through the downloader middleware to crawl and crawl the data, the downloader passes the crawled data back to the Scrapy engine 6Scrapy engine through the downloader middleware to pass the crawled data to the crawler through the crawler middleware for data parsing, extracting 7. After the crawler has finished processing the data, the extracted data and new requests are passed back to the Scrapy engine 8Scrapy to transmit the extracted data to the data pipeline for data cleaning and other operations. At the same time, pass the new request to the scheduler to prepare for the next page crawl 9 repeat steps 2-8 until there is no new request in the scheduler, and the data crawl ends. 3. Scrapy installation

Win + R opens and runs, click OK

Then type it on the command line

After the sentence "pip install scrapy-I https://pypi.doubanio.com/simple/#"-I https://pypi.doubanio.com/simple/ means to use the source of Douban, which makes the installation faster.

Then click enter and wait for it to complete its installation!

Then, when we type scrapy on the command line, the message of scrapy will be displayed, which means that the installation is successful!

Attention! Then we type explorer on the command line. (in Mac is open. Pay attention. With a space in front of you) command and enter, you can open the directory where the command line is currently located. Next, we're going to start writing code in this directory.

IV. Practical Application of Scrapy

The website we tried to crawl with Scrapy this time is: niche software https://www.appinn.com/category/windows/

Before crawling a web page, we need to create a code file and then use the Scrapy command to execute it.

In the above, we use explorer. Command opens the directory, where we create a spider.py file called ↓

Method: create a text file and rename it.

Then put the crawler code in, now you copy the paste code, give it a try, and then I'll explain the meaning of the code!

Crawler code

Import scrapy# defines a class called TitleSpider that inherits from scrapy.Spiderclass TitleSpider (scrapy.Spider): name = 'title-spider' # sets the page to start crawling start_urls = [' https://www.appinn.com/category/windows/'] def parse (self) Response): # find all article tags for article in response.css ('article'): # parse the link and title a = article.css (' h3.title a') if a under article: result = {'title': a.attrib [' title'] 'url': a.attrib [' href'],} # get the result yield result # parse the link to the next page next_page = response.css ('a.next::attr (href)'). Get () if next_page is not None: # start climbing the next page Parsing yield response.follow (next_page, self.parse) using the parse method

Then execute the runspider command of scrapy on the command line

Scrapy runspider spider.py-t csv-o apps.csv# spider.py is the file name of the crawler code just written #-t indicates the output file format. We use csv, Excel and other tools to open #-o to represent the output file name, so there will be an apps.csv file after execution.

After typing the above command, wait a minute, you should be able to see a lot of output?

Results of page crawling

And an extra apps.csv file in the directory. Students with Excel can open apps.csv with Excel, or directly open it with notepad or other editors.

After opening it, you can see the titles and links of more than 400 software recommendation articles for niche software?

But our code does not use requests, beautifulsoup, concurrent and file-related libraries at all, how to complete a quick crawl and write to the file? Don't worry, let me explain it to you slowly!

What does this string of code do?

The crawler code used above

Import scrapy# defines a class called TitleSpider that inherits from scrapy.Spiderclass TitleSpider (scrapy.Spider): name = 'title-spider' # sets the page to start crawling start_urls = [' https://www.appinn.com/category/windows/'] def parse (self) Response): # find all article tags for article in response.css ('article'): # parse the link and title a = article.css (' h3.title a') if a under article: result = {'title': a.attrib [' title'] 'url': a.attrib [' href'],} # get the result yield result # parse the link to the next page next_page = response.css ('a.next::attr (href)'). Get () if next_page is not None: # start climbing the next page Parsing yield response.follow (next_page, self.parse) using the parse method

When running scrapy runspider spider.py-t csv-o apps.csv, Scrapy executes the crawler we wrote in spider.py, the complete code above

1. First, Scrapy reads the startup page start_urls that we set up, starts to request this page, and gets a response.

/ / An highlighted blockstart_urls = ['https://www.appinn.com/category/windows/']

2. After that, Scrapy leaves the response to the default parsing method, parse. Response response is the first parameter of parse

Def parse (self, response):

3. In the parse method written by ourselves, there are two parts: one is to parse the article tag in the page and get the title and link as the result of crawling; the other is to parse the position of the button on the next page, get the link to the next page, and also continue to request, and then use the parse method to parse

# tell Scrapyyield result# to tell Scrapy to start climbing the next page, and use the parse method to parse yield response.follow (next_page, self.parse)

Yield is a more advanced use of Python, here we just need to know, we tell Scrapy two things through yield: we got the result, go deal with it, we get the next link to climb, go crawl.

Flow chart

Yes, Scrapy does everything for you except to parse the data you want. This is the biggest advantage of Scrapy:

Where's requests? No, just give the link to Scrapy and the request will be completed automatically.

Where's concurrent? No, Scrapy will automatically make all requests concurrent.

How to write the results into the file? Without realizing the code to write the file, you can automatically write the file by notifying Scrapy of the result using yield.

How do I continue to climb to the next page? Use yield notification = Scrapy the link and handling method of the next page

Where's BeautifulSoup? You don't need it. Scrapy provides a useful CSS selector.

Parsing the data is a matter of concern, even if Scrapy doesn't force us to use anything, so we can continue to use BeautifulSoup, just pass response.text to BeautifulSoup in the parse () method for parsing and extraction.

But Scrapy provides a useful tool called the CSS selector. We briefly introduced the CSS selector in BeautifulSoup. Do you still remember?

For those of you who have forgotten about BeautifulSoup, you can take a look at an article I wrote earlier: requests Library and BeautifulSoup Library.

The syntax of the CSS selector in Scrapy is similar to that in BeautifulSoup, but the CSS selector in Scrapy is more powerful.

# parsing all article tags response.css ('article') # parsing from article the a tag article.css (' h3.title a') under the h3 tag whose class is title # take out the href attribute value a.attrib ['href'] # parse the href attribute of the a tag whose class is next from the response, and take its value response.css (' a.next::attr (href)'). Get ()

The CSS selector in scrapy can replace the function of beautifulsoup, and we can directly use it to parse and extract the acquired data. See here, then look back at the complete code above, try to combine the flow chart to understand it again, and you will have a good understanding.

5. Css selector teaching of Scrapy

Let's open the website we crawled before: https://www.appinn.com/category/windows/, the niche software.

Use the web developer tool to select the next page ↓

Notice the framed part of the screenshot, and the browser has shown what the CSS selection method for this button is. It tells us that the button on the next page is selected by a.next.page-numbers.

Before you start teaching css selectors, I suggest you use the interactive tools provided by Scrapy to experience CSS selectors. The way is to enter the following command on the command line and enter

Scrapy shell "https://www.appinn.com/category/windows/"

At this point, Scrapy has visited the link and recorded the results, and you will enter an interactive environment in which we can write code and execute sentence by sentence. Type response and enter, and you will see a response similar to the following, which is the result of the web page obtained above.

As I said before, the output of 200 is actually a response status code, which means the request is successful!

Previous articles: requests Library and BeautifulSoup Library

Let's learn about the css selector. Take the "next page" button in the following figure as an example.

Select by label signature

The button on the next page of the niche software site is selected by a.next.page-numbers, where an is the signature. Try typing response.css ('a') in an interactive environment and you can see all the an elements on the page. The same is true of other elements, such as writing response.css ('ul') to select all ul elements, and response.css (' div') to select div elements.

Select by class

. next and. Page-numbers in a.next.page-numbers denote the name of class. When we want to select the div element where class contains container, we can write response.css ('div.container').

The selector above is preceded by a label signature. Represents class, followed by the name of class. Note that they are close to each other and there can be no spaces in the middle!

When the element to be selected has more than one class, such as the following element

next page

This an element has two class of next and page-number, which can be written more than one. To select: response.css ('a.next.pageMushroom numbers'). Indicates that selecting class contains both the an elements of next and page-numbers, where they must also be next to each other, with no spaces before.

Select by id

In addition to the class selector, there is also an id selector, which also uses # for id in Scrapy. For example, the menu button on the web page, we see that its id is pull,class is toggle-mobile-menu. So we can write response.css ('ajar pull'), which means we want to choose an an element whose id is pull.

Of course, you can also use it in combination: response.css ('axiompull.toggleMobileMoveMemen`). It means that we want to choose id as pull, and class contains the an element of toggle-mobile-menu.

Select according to hierarchical relationship

Or this page of niche software, if we want to use the CSS selector to select the an element in the title, and use Chrome to select it, we find that this an element has neither id nor class, and the browser only gives us a prompt.

At this point, we need to look for clues on the parent element of the element, for example, we find that this an element is below an H3 element, and this H3 element has class, and the class is title and post-title. So what we need to do is choose the an element below the h3 element of title and post-title for class, and write with the CSS selector.

Response.css ('h3.title.post-title a rime text')

As you can see, the h3 element of title and post-title is written as h3.title.post-title, which is closely linked, while the following an element adds a space before a. Do you remember the rules we said before? The rule is: juxtaposition relationships are linked together, and hierarchical relationships are separated by spaces.

.title and .post-title are closely followed by h3, both of which are filters for h3. The a followed by the space represents all an elements in the children of the element that meets the h3.title.post-title condition.

As we said before, the space is followed by all the child elements, no matter how many layers they are in. If you only want the child elements of the first layer, you should separate them with >. The an element here is the child element of the first layer, so the effect of h3.title.post-title an and h3.title.post-title > an is the same.

Take the text in the element

We have got the an element of the title position, and if we want to get the text content in it, we need to add:: text, the code is as follows

Response.css ('h3.title.post-title a rime text')

If you execute it in an interactive environment, you will find that the text content can be obtained, but it is not the plain text we want. If you want to get the plain text, you need to use the get () or getall () method, as follows

# take the first data response.css that meets the condition ('h3.title.post-title a response.css'). Get () # take all the data that meets the condition. Getall () take the attribute of the element

Let's take this an element as an example. If we want to get the href attribute of this an element, we need to call the attrib attribute of this element. Execute the following two lines of code in an interactive environment

# get the first a tag a = response.css ('h3.title.post-title a') a.attrib ['href'] that meets the selector criteria

The attrib attribute is actually a dictionary that stores all the HTML attributes on the element. If you replace the second sentence with a.attrib, you can see all the attributes on this an element. Similarly, type a.attrib ['title'] and you can get its title attribute.

Now let's try to print the href attribute of all tags that meet the h3.title.post-title a condition, like this

For an in response.css ('h3.title.post-title a'): print (a.attrib ['href'])

Or another way to write it is to get the href attribute by adding:: attr (href)

For href in response.css ('h3.title.post-title a::attr (href)'). Getall (): print (href) above is all the content of this article entitled "how to use the Scrapy Framework in Python Crawler". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report