In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what are the skills of Python crawler data operation". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the skills of Python crawler data operation".
Demand
Crawl the item list page of a website, get its url, title and other information, as the subsequent task of crawling the details page url.
Code
#-*-coding: utf-8-*-# @ Time: 2019-11-08 14:04 # @ Author: cxa # @ File: motor_helper.py # @ Software: PyCharm import asyncio import datetime from loguru import logger from motor.motor_asyncio import AsyncIOMotorClient from collections import Iterable try: import uvloop asyncio.set_event_loop_policy (uvloop.EventLoopPolicy ()) except ImportError: pass db_configs = {'host':' 127.0.0.1' 'port': '27017', 'db_name':' mafengwo' 'user':'} class MotorOperation: def _ init__ (self): self.__dict__.update (* * db_configs) if self.user: self.motor_uri = f "mongodb:// {self.user}: {self.passwd} @ {self.host}: {self.port} / {self.db_name}? authSource= {self.db_name}" Else: self.motor_uri = f "mongodb:// {self.host}: {self.port} / {self.db_name}" self.client = AsyncIOMotorClient (self.motor_uri) self.mb = self.client [self.db _ name] async def save_data_with_status (self Items, col= "seed_data"): for item in items: data = dict () data ["update_time"] = datetime.datetime.now () data ["status"] = 0 # 0 initial data.update (item) print ("data") Data) await self.mb.update _ one ({"url": item.get ("url")}, {'$set': data,'$setOnInsert': {'create_time': datetime.datetime.now ()}}, upsert=True) async def add_index (self) Col= "seed_data"): # add the index await self.mb.create _ index ('url')
Because my crawler was written by the asynchronous network module aiohttp, I chose motor, the asynchronous version of pymongo, to operate.
The basic attribute of asynchronous code is the emergence of async/await pairs. If you remove the above await and async, it is similar to pymongo. Async is not the point here, but how we deal with each piece of data.
In addition to the url, title and other information of the web page, I need to add 3 fields. They are create_time and status,update_time.
These three fields represent data insertion data, status and update time, respectively.
So why did I add three fields?
First of all, we need to determine whether each task data exists, in my case, the update does not exist and then insert, then I need a query condition, as an update condition, obviously here can use the task url as the only condition (you can also use the url+ title to do a md5 and save), OK query conditions are determined.
The following create_time is easier to understand is the data insertion time, the key is why there is a update_time, this has something to do with the status field. Key points: this status is used as a sign for subsequent reptiles to crawl. Currently, this status has four values, 0-4, which is how I define it.
0: initial statu
1: tasks in crawling
2: crawled successfully
3: failed to crawl
4: crawled successfully but did not match the task.
Later, as the task crawls, the state is constantly changing, and we need to update the update_time to the latest time. This current word does not show any usefulness. Its use scenario is to capture repeated tasks. For example, today I grabbed url1 and url2 in the task list. If I catch them again the next day, in order to distinguish between crawling failure and success, we can infer from create_time and update_time. If the two are the same and the current date indicates that they have just been captured. If the date of update_time is newer than create_time, it can be explained that a duplicate task has been caught. That's all I have to say about the design of the field.
The following is the implementation, we can use the update_one method to exist or insert the data, because url as the query condition, it is best to add an index if the amount is large. This is the add_index method above.
All right, it's best to insert the specific code for the update.
It is important to note that
{'$set': data,'$setOnInsert': {'create_time': datetime.datetime.now ()}}
The fields used in $setOnInsert are inserted only when the data does not exist, and the existence is not moved, only those specified in $set are inserted.
In addition, the fields used in $setOnInsert cannot appear again in $set.
Upsert=True stands for inserting if it doesn't exist.
Thank you for your reading, the above is "what are the skills of Python crawler data operation?" after the study of this article, I believe you have a deeper understanding of what the skills of Python crawler data operation have, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.