Example Analysis of Scrapy Framework of python Crawler 07/13 Update SLTechnology News&Howtos

Example Analysis of Scrapy Framework of python Crawler

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces the example analysis of python crawler Scrapy framework, the article is very detailed, has a certain reference value, interested friends must read it!

1. Media pipeline 1.1, the characteristics of the media pipeline the media pipeline implements the following characteristics:

Avoid re-downloading recently downloaded media

Specify storage location (file system directory, Amazon S3 bucket, Google Cloud Storage bucket)

Image pipes have some additional image processing capabilities:

Convert all downloaded pictures to common format (JPG) and mode (RGB)

Generate thumbnails

Check the width / height of the image for minimum size filtering

1.2 、 Media pipeline settings ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 120} enable FILES_STORE =' / path/to/valid/dir' file pipeline storage location IMAGES_STORE ='/ path/to/valid/dir' picture pipe storage location FILES_URLS_FIELD = 'field_name_for_your_files_urls' customization file url field FILES_RESULT _ FIELD = 'field_name_for_your_processed_files' custom result field IMAGES_URLS_FIELD =' field_name_for_your_images_urls' custom picture url field IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images' result field FILES_EXPIRES = 90 file expiration time default 90 days IMAGES_EXPIRES = 90 picture expiration time default 90 days IMAGES_THUMBS = {'small': (50) ), 'big': (270,270)} thumbnail size IMAGES_MIN_HEIGHT = 110filter minimum height IMAGES_MIN_WIDTH = 110filter minimum width MEDIA_ALLOW_REDIRECTS = whether True is redirected II. Introduction to ImagesPipeline class # parsing the configuration field def _ _ init__ (self, store_uri, download_func=None, settings=None) in settings # Image download def image_downloaded (self, response, request) Info) # Image size filtering # thumbnail generation def get_images (self, response, request, info) # convert picture format def convert_image (self, image, size=None) # generate media request rewritable def get_media_requests (self, item, info) return [Request (x) for x in item.get (self.images_urls_field) [])] # change url to request to send to engine # this method gets the file name to rewrite def item_completed (self, results, item, info) # File path def file_path (self, request, response=None, info=None) # Storage path of thumbnails def thumb_path (self, request, thumb_id, response=None, info=None): 3. Small case: crawl Baidu images using picture pipeline

(of course, you can also crawl Baidu pictures without using the picture channel, but this still requires us to analyze the code of the web page, which is still a little troublesome. Using the picture channel can save this step.)

3.1Fil spider

Note: since we need to add all the request headers, we need to override the start_requests function

Import reimport scrapyfrom.. items import DbimgItemclass DbSpider (scrapy.Spider): name = 'db' # allowed_domains = [' xxx.com'] start_urls = ['https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111110&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%8B%97& Oq=%E7%8B%97&rsp=-1'] def start_requests (self): # because all request headers need to be added So we need to rewrite the start_requests function # url = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111110&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%8B%97&oq=%E7%8B%97&rsp=-1' headers = {"Accept": "text/html Application/xhtml+xml,application/xml Qimagine 0.9 gzip deflate, br, Accept-Language: "Accept-Language": "zh-CN,zh" Qroom0.9 "," Cache-Control ":" max-age=0 "," Connection ":" keep-alive "," Cookie ":" BIDUPSID=4B61D634D704A324E3C7E274BF11F280; PSTM=1624157516; BAIDUID=4B61D634D704A324C7EA5BA47BA5886E:FG=1; _ _ yjs_duid=1_f7116f04cddf75093b9236654a2d70931624173362209; BAIDUID_BFESS=101022AEE931E08A9B9A3BA623709CFE:FG=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR [dG2JNJb _ ajR] = mk3SLVN4HKm; cleanHistoryStatus=0; Houses PSSIDs 34099 "33969" 34222 "31660" 34226 "33848" 34113 "34073" 33607 "34107" 34418 "26350" 22159; delPer=0; PSINO=6; BA_HECTOR=24ak842ka421210koq1gdtj070r BDRCVFR [X _ XKQks0S63] = mk3SLVN4HKm; userFrom=www.baidu.com; firstShowTip=1; indexPageSugList=%5B%22%E7%8B%97%22%2C%22%E7%8C%AB%E5%92%AA%22%2C%22%E5%B0%8F%E9%80%8F%E6%98%8E%22%5D Ab_sr=1.0.1_OGYwMTZiMjg5ZTNiYmUxODIxOTgyYTllZGMyMzhjODE2ZWE5OGY4YmEyZWVjOGZhOWIxM2NlM2FhZTQxMmFjODY0OWZiNzQxMjVlMWIyODVlZWFiZjY2NTQyMTZhY2NjNTM5NDNmYTFmZjgxMTlkOGYxYTUzYTIzMzA0NDE3MGNmZDhkYTBkZmJiMmJhZmFkZDNmZTM1ZmI2MWZkNzYyYQ== "," Host ":" image.baidu.com "," Referer ":" https://image.baidu.com/", "sec-ch-ua":'"Not;A Brand"; v = "99", "Google Chrome"; v = "91", "Chromium" V = "91", "sec-ch-ua-mobile": "? 0", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "same-origin", "Sec-Fetch-User": "? 1", "Upgrade-Insecure-Requests": "1" "User-Agent": "Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 "} for url in self.start_urls: yield scrapy.Request (url,headers=headers,callback=self.parse,dont_filter=True) def parse (self, response): img_urls = re.findall ('" thumbURL ":" (. *?) "' Response.text) # print (img_urls) item = DbimgItem () item ['image_urls'] = img_urls yield item3.2, items file import scrapyclass DbimgItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () image_urls = scrapy.Field () 3.3, The settings file ROBOTSTXT_OBEY = False# opens the pipe ITEM_PIPELINES = {# 'dbimg.pipelines.DbimgPipeline': 300 that we wrote 'dbimg.pipelines.ImgPipe': 300,} # Image location IMAGES_STORE =' D:/python test/ crawler / scrapy6/dbimg/imgs'3.4, pipelines file import osfrom itemadapter import ItemAdapterfrom scrapy.pipelines.images import ImagesPipelineimport settings "def item_completed (self, results, item, info): with suppress (KeyError): ItemAdapter (item) [self.images_result_field] = [x for ok] X in results if ok] return item "" class ImgPipe (ImagesPipeline): num=0 # rewrite the name of the image obtained by modifying this function or the image name is a string of numeric letters def item_completed (self, results, item, info): images_path = [x ['path'] for ok, x in results if ok] # print (' results:', results) check results's data format first Then we can get the value for image_path in images_path we need: os.rename (settings.IMAGES_STORE + "/" + image_path, settings.IMAGES_STORE + "/" + str (self.num) + ".jpg") self.num + = 1

Results:

The above is all the contents of the article "sample Analysis of the python Crawler Scrapy Framework". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.