In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Image path storage and item json is what, for this problem, this article details the corresponding analysis and solution, hoping to help more small partners who want to solve this problem find a simpler and easier way.
item_completed () method
syntax: item_completed(results, items, info);
When all the image requests in a single project are complete (whether the download succeeds or fails), the ImagesPipeline.item_completed() method is called. The item_completed() method must return output that will be sent to subsequent item pipeline stages, so item must be returned or deleted (item_completed returns all items by default);
2. Overwrite the item_completed method in pipline
Overwrite the item_completed method in ImagePipeline to get the save path of the image
class ImagePipeline(ImagesPipeline): def file_path(self, request, response=None, info=None): ## start of deprecation warning block (can be removed in the future) def _warn(): from scrapy.exceptions import ScrapyDeprecationWarning import warnings warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, ' 'please use file_path(request, response=None, info=None) instead', category=ScrapyDeprecationWarning, stacklevel=1) # check if called from image_key or file_key with url as first argument if not isinstance(request, Request): _warn() url = request else: url = request.url # detect if file_key() or image_key() methods have been overridden if not hasattr(self.file_key, '_base'): _warn() return self.file_key(url) elif not hasattr(self.image_key, '_base'): _warn() return self.image_key(url) ## end of deprecation warning block image_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation #Change to Time for Directory return '{}/{}.jpg'.format(datetime.now().year,image_guid) def item_completed(self, results, item, info): #Get the picture address and save it to the list values = [value['path'] for ok, value in results if ok] #assign value to item item['image_path'] = values.pop(0) if values else 'default.jpg' return item3. Create md5 function
We can use hashlib.md5 in scrapy to handle url, first create a package called utils in the same directory as the project settings file, and then create an md5 file in this package; import md5 from hashlib before use, instantiate md5() in hashlib, then pass url with update, and extract the digest with hexdigest(). You can also use isinstance() to determine the encoding type of the incoming value, and use encode() to convert unicode encoding to other encoded strings, etc.;
from hashlib import md5def get_md5(url):if isinstance(url, str): #Convert to bytecode first url = url.encode() print(url)obj = md5()obj.update(url)return obj.hexdigest()if __name__ == '__main__':print(get_md5('www.baidu.com'))4. Add the field import scrapyclass XkdDribbbleSpiderItem(scrapy.Item) to item: title = scrapy.Field() image_url = scrapy.Field() date = scrapy.Field() #Add image path to item image_path = scrapy.Field() #Add URL of page to item url = scrapy.Field() #Add url hash value field url_id = scrapy.Field()5. Return item in spider import scrapyfrom urllib import parsefrom scrapy.http import Requestfrom datetime import datetime from.. items import XkdDribbbleSpiderItemfrom .. utils.md5_tool import get_md5class DribbbleSpider(scrapy.Spider): name = 'dribbble' allowed_domains = ['dribbble.com'] start_urls = ['https://dribbble.com/stories']def parse(self, response): #Get url value of tag a # selector a_selectors = response.css('div.teaser a') for a_selector in a_selectors: image_url = a_selector.css('img::attr(src)').extract()[0] page_url = a_selector.css('::attr(href)').extract()[0] yield Request(url=parse.urljoin(response.url, page_url), callback=self.parse_analyse,meta={'a_image_url': image_url}) def parse_analyse(self, response): title = response.css('header h2::text').extract_first() image_url = response.meta.get('a_image_url') date_raw = response.css('p span.date::text').extract()[0] date_str = date_raw.strip() date = datetime.strptime(date_str, '%b %d, %Y').date() item = XkdDribbbleSpiderItem() item['title'] = title item['image_url'] = [image_url] item['date'] = date item['url'] = response.url item['url_id'] = get_md5(response.url) # item data model landing, data persistence yield item6. Create JsonSavePipeline, which is used to write item to file import codecsimport jsonclass JsonSavePipeline: def process_item(self, item, spider): #Convert items returned from spider to dictionary file = codecs.open('blog.json', mode='a') dict_item = dict(item) #json dictionary line = json.dumps(dict_item, ensure_ascii=False) + '\n' #Write to file file.write(line) Return to item again file.close()7. Add JsonSavePipeline'XKD_Dribbble_Spider.pipelines.JsonSavePipeline': 2 in the settings file. The answer to the question about how to store the image path and json the item is shared here. I hope the above content can be of some help to everyone. If you still have a lot of doubts, you can pay attention to the industry information channel for more relevant knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.