In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article will explain in detail how to use the Scrapy web crawler framework. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.
Scrapy introduction
Standard introduction
Scrapy is an application framework written to crawl website data and extract structural data. It is very famous and very powerful. The so-called framework is a highly versatile project template that has been integrated with various functions (high-performance asynchronous download, queuing, distribution, parsing, persistence, etc.). For the study of the framework, the key point is to learn the characteristics of the framework and the usage of each function.
To speak human words is
As long as you are a crawler, you can van with this, because it integrates some great tools and has high crawling performance, and there are a lot of hooks reserved for easy expansion, so it is the best choice for home reptiles.
Install scrapy under windows
Command
Pip install scrapy
By default, direct pip install scrapy may fail, if there is no source change, plus temporary source installation try, here is Tsinghua source, common installation problems can refer to this article: Windows installation Scrapy method and common installation problems summary-Scrapy installation tutorial.
Command
Pip install scrapy-I https://pypi.tuna.tsinghua.edu.cn/simple
Scrapy creates a crawler project
Command
Scrapy startproject
Example: create an embarrassing encyclopedia crawler project (remember to cd to a clean directory ha)
Scrapy startproject qiushibaike
Note: at this point, we have created a crawler project, but the crawler project is a folder
Enter the crawler project
If you want to enter the project, you need to cd into this directory, as shown above, first cd, and then create the spider.
Analysis of project directory structure
At this point, we have entered the project, which is structured as follows, with a folder with the same name as the project and a scrapy.cfg file
Scrapy.cfg # scrapy configuration, use this to configure qiushibaike # folder with the same name of the project items.py # data storage template, customize the fields to be saved middlewares.py # crawler middleware pipelines.py # write data persistence code settings.py # configuration file, for example: control crawl speed, how much concurrency, etc. _ _ init__.py spiders # crawler directory Crawler files, write data parsing code _ _ init__.py
Well, maybe you can't understand the meaning of these catalogs at this time, but don't panic, you may understand when you use it, don't panic.
Create spiders
By doing the above, assume that you have successfully installed scrapy and entered the project you created
So, let's create a spider to crawl the jokes of the embarrassing encyclopedia.
Create Spider command
Scrapy genspider
Example: a joke spider to create an encyclopedia of embarrassing events
Scrapy genspider duanzi ww.com
Note: the starting url of a web page can be written or changed at will, but there must be
There will be one more duanzi.py file under the spider folder.
The code is explained as follows
Prepare before crawling data
After creating a spider, you need to configure something that cannot be crawled directly. By default, it cannot be crawled, and you need to configure it simply.
Open the settings.py file and find the ROBOTSTXT_OBEY and USER_AGENT variables
ROBOTSTXT_OBEY configuration
It means that False does not abide by the robot agreement. By default, only search engine sites are allowed to crawl, such as Baidu, Bing, etc., personal crawling needs to ignore this, otherwise it cannot be crawled.
USER_AGENT configuration
User-Agent is a parameter that a basic request must take. If this parameter is not normal, it must not be crawled.
User-Agent
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
Get embarrassing stories, encyclopedia jokes, links.
When the preparatory work is done, let's get started!
Here we need to have the grammar basis of xpath, in fact, it is quite simple, there is no basis to remember Baidu, in fact, not Baidu also does not matter, follow the study, probably can understand
Realize the function
Get the a tag connection under each paragraph through xpath
Note: censor elements and hold down crtl+f search content and write xpath here is no longer verbose
Analyze page rules
Through the review tool, we can see that the tags that class contains article are individual articles, and you may think that xpath might be able to write this.
Xpath code
/ / div [@ class='article']
But you will find that you can't find any of them, because it's an inclusive relationship, so you need to use the contains keyword.
We need to write like this.
Xpath code
/ / div [contains (@ class, "article")]
However, we will find that there is too much positioning, and it is not the div of each joke, so we have to include a few more, so it is the div of each joke.
/ / div [contains (@ class, "article") and contains (@ class, "block")]
The above-mentioned jokes have been successfully located one by one. On this basis, locate the a tag under each joke.
According to the censorship element, it is found that the a tag of class= "contentHerf" under each paragraph is the details page of each paragraph.
Details page. The href of the a tag to be located is indeed the url of the details page.
Xpath code
/ / div [contains (@ class, "article") and contains (@ class, "block")] / / a [@ class= "contentHerf"]
In this way, we locate the a tags one by one, and it's okay to operate at least on the console, so let's use Python code to manipulate it.
Code
Def parse (self, response): a_href_list = response.xpath ('/ / div [contains (@ class, "article") and contains (@ class, "block")] / a [@ class= "contentHerf"] / @ href') .extract () print (a_href_list)
Start the spider command
Scrapy crawl [--nolog]
Note:-- the nolog parameter does not indicate a series of logs and is generally used for debugging. Adding this parameter means that only print content is entered.
Example: start a sub-command
Scrapy crawl duanzi-nolog
Successfully got every link.
Get the details page content
In the above, we have successfully obtained the link to each joke, but we will find that some of the jokes are incomplete, and we need to enter the details page to see all the jokes, so let's use crawlers to manipulate them.
Let's define the title and content.
According to the element review, the positioning xpath of the title is:
/ / H2 [@ class= "article-title"]
The xpath of the content is:
/ / div [@ class= "content"]
After determining the xpath location of the title and content, let's implement it in the python code.
Note: but solve one problem first. The details page belongs to the second call, so we also need to make the second call before writing the code.
Code
# details page def detail (self, response): title = response.xpath ('/ / H2 [@ class= "article-title"] / text ()'). Extract () content = response.xpath ('/ / div [@ class= "content"] / / text ()'). Extract () print ("title:") print (title) print ("content") print (content) def parse (self) Response): a_href_list = response.xpath ('/ / div [contains (@ class, "article") and contains (@ class) "block")] / a [@ class= "contentHerf"] / @ href') .extract () print (a_href_list) base_url = "https://www.qiushibaike.com" for a_href in a_href_list: url= f" {base_url} {a_href} "yield scrapy.Request (url=url, callback=self.detail)
Result
But we will find that each seems to be in the form of a list, which doesn't seem to work. Let's modify the code a little so that what we get is the normal text, as shown in the following figure:
Summary of the above order
Create a crawler project
Scrapy startproject
Create spiders
Scrapy genspider
Start the crawler. The-- nolog parameter does not indicate a series of logs, which is generally used for debugging. Adding this parameter means that only print content is entered.
Scrapy crawl [--nolog] on "how to use the Scrapy web crawler framework" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.