In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how to use Scrapy to build a web crawler". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use Scrapy to build a web crawler".
Let's take a look at how Scrapy does these functions. To prepare the Scrapy environment first, you need to install Python (v2.7 in this article) and pip, and then use pip to install lxml and scrapy. Personally, it is strongly recommended that you use virtualenv to install the environment so that there is no conflict between different projects. I won't go into the detailed steps here. For Mac users, note that when installing lxml using pip, an error similar to the following occurs:
Error: # include "xml/xmlversion.h" not found
To solve this problem, you need to install Xcode's command line tools first, by executing the following command on the command line.
$xcode-select-install
After the environment is installed, let's use Scrapy to implement a simple crawler to grab the article title, address and abstract of this blog site.
one
one
two
three
four
five
six
seven
eight
Set the fields of the content to be crawled. In this case, the title, address and summary of the article
Modify the "items.py" file and add the following code to the "MyCrawlerItem" class:
Python
#-*-coding: utf-8-*-
Import scrapy
Class MyCrawlerItem (scrapy.Item):
Title = scrapy.Field () # article title
Url = scrapy.Field () # article address
Summary = scrapy.Field () # abstracts of articles
Pass
For friends who are not familiar with XPath, you can get the XPath of the element through Chrome's debug tool.
one
Save the results to the database
Here we use MongoDB, you need to install Python's MongoDB library "pymongo" first. Edit the "pipelines.py" file in the "my_crawler" directory and add the following code to the "MyCrawlerPipeline" class:
Python
#-*-coding: utf-8-*-
Import pymongo
From scrapy.conf import settings
From scrapy.exceptions import DropItem
Class MyCrawlerPipeline (object):
Def _ init__ (self):
# set MongoDB connection
Connection = pymongo.Connection (
Settings ['MONGO_SERVER']
Settings ['MONGO_PORT']
)
Db = connection [settings ['MONGO_DB']]
Self.collection = db [settings ['MONGO_COLLECTION']]
# handle each crawled MyCrawlerItem item
Def process_item (self, item, spider):
Valid = True
For data in item:
If not data: # filter out items with empty fields
Valid = False
Raise DropItem ("Missing {0}!" .format (data))
If valid:
# you can also use self.collection.insert (dict (item)), and use upsert to prevent duplicates
Self.collection.update ({'url': item [' url']}, dict (item), upsert=True)
Return item
Then open the "settings.py" file in the "my_crawler" directory, and add the setting of pipeline at the end of the file:
Python
ITEM_PIPELINES = {
'my_crawler.pipelines.MyCrawlerPipeline': 300, # set Pipeline, which can be multiple. The value is execution priority.
}
# MongoDB connection Information
MONGO_SERVER = 'localhost'
MONGO_PORT = 27017
MONGO_DB = 'bjhee'
MONGO_COLLECTION = 'articles'
DOWNLOAD_DELAY=2 # if the network is slow, you can add some delay (in seconds)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.