How to build a web crawler with Scrapy 10/29 Update SLTechnology News&Howtos

How to build a web crawler with Scrapy

2025-10-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use Scrapy to build a web crawler". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use Scrapy to build a web crawler".

Let's take a look at how Scrapy does these functions. To prepare the Scrapy environment first, you need to install Python (v2.7 in this article) and pip, and then use pip to install lxml and scrapy. Personally, it is strongly recommended that you use virtualenv to install the environment so that there is no conflict between different projects. I won't go into the detailed steps here. For Mac users, note that when installing lxml using pip, an error similar to the following occurs:

Error: # include "xml/xmlversion.h" not found

To solve this problem, you need to install Xcode's command line tools first, by executing the following command on the command line.

$xcode-select-install

After the environment is installed, let's use Scrapy to implement a simple crawler to grab the article title, address and abstract of this blog site.

one

two

three

four

five

six

seven

eight

Set the fields of the content to be crawled. In this case, the title, address and summary of the article

Modify the "items.py" file and add the following code to the "MyCrawlerItem" class:

Python

#-*-coding: utf-8-*-

Import scrapy

Class MyCrawlerItem (scrapy.Item):

Title = scrapy.Field () # article title

Url = scrapy.Field () # article address

Summary = scrapy.Field () # abstracts of articles

Pass

For friends who are not familiar with XPath, you can get the XPath of the element through Chrome's debug tool.

one

Save the results to the database

Here we use MongoDB, you need to install Python's MongoDB library "pymongo" first. Edit the "pipelines.py" file in the "my_crawler" directory and add the following code to the "MyCrawlerPipeline" class:

Python

#-*-coding: utf-8-*-

Import pymongo

From scrapy.conf import settings

From scrapy.exceptions import DropItem

Class MyCrawlerPipeline (object):

Def _ init__ (self):

# set MongoDB connection

Connection = pymongo.Connection (

Settings ['MONGO_SERVER']

Settings ['MONGO_PORT']

)

Db = connection [settings ['MONGO_DB']]

Self.collection = db [settings ['MONGO_COLLECTION']]

# handle each crawled MyCrawlerItem item

Def process_item (self, item, spider):

Valid = True

For data in item:

If not data: # filter out items with empty fields

Valid = False

Raise DropItem ("Missing {0}!" .format (data))

If valid:

# you can also use self.collection.insert (dict (item)), and use upsert to prevent duplicates

Self.collection.update ({'url': item [' url']}, dict (item), upsert=True)

Return item

Then open the "settings.py" file in the "my_crawler" directory, and add the setting of pipeline at the end of the file:

Python

ITEM_PIPELINES = {

'my_crawler.pipelines.MyCrawlerPipeline': 300, # set Pipeline, which can be multiple. The value is execution priority.

}

# MongoDB connection Information

MONGO_SERVER = 'localhost'

MONGO_PORT = 27017

MONGO_DB = 'bjhee'

MONGO_COLLECTION = 'articles'

DOWNLOAD_DELAY=2 # if the network is slow, you can add some delay (in seconds)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.