Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Scrapy to crawl web pages

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to grab web pages with Scrapy. The quality of the article is high, so Xiaobian shares it with you for reference. I hope you have a certain understanding of relevant knowledge after reading this article.

Scrapy is a fast, advanced Web crawler and Web scraping framework for crawling websites and extracting structured data from their pages. It can be used for a variety of purposes, from data mining to monitoring and automated testing.

The old rule is to install it with pip install scrapy before using it. If you encounter an error during installation, it is usually error: Microsoft Visual C++ 14.0 is required. Just visit https://www.lfd.uci.edu/<$gohlke/pythonlibs/#twisted and download Twisted-19.2.1-cp37-cp37 m-win_amd64 to install it. Note that cp37 represents my native python version 3.7 amd64 represents the number of bits in my operating system.

Install using pip install Twisted-19.2.1-cp37-cp37 m-win_amd64.whl, and then reinstall scrapy will be successfully installed; after successful installation, we can use scrapy command to create a crawler project.

Next, run cmd on my desktop and create a project using scrapy startproject webtutorial:

A webtutorial folder will be generated on the desktop. Let's look at the directory structure below:

Then we create a new quotes_spider.py in the spiders folder and write a crawler to crawl http://quotes.toscrape.com and save it as an html file. The screenshot of the website is as follows:

The code is as follows:

import scrapy

#Define reptile class QuotesSpider(scrapy.Spider): #Specify crawler name to be used later name = "quotes" #Start Request Method def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/'] for url in urls: yield scrapy.Request(url=url, callback=self.parse)#parse writes returned content into html def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)

The following directory structure is:

Then we switch to the webtutorial folder on the command line and execute the command scrape crawl quotes (quotes is the crawler name just specified):

No module named 'win32api', here we install win32api

Use the command pip install pypiwin32 and continue with scrapy crawl quotes:

The crawler task is successfully executed, and two html files will be generated under the webtutorial folder:

About how to use Scrapy to crawl web pages to share here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report