In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces python programming scrapy simple code how to achieve Sogou picture downloader, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.
Destination site description
The site to be collected this time is Sogou photo channel. The data of this channel is returned directly by the API. The API is as follows:
Https://pic.sogou.com/napi/pc/recommend?key=homeFeedData&category=feed&start=10&len=10https://pic.sogou.com/napi/pc/recommend?key=homeFeedData&category=feed&start=20&len=10
Among them, only the start parameter is changing, so it is relatively simple to implement.
Write core crawler files
Import scrapyclass SgSpider (scrapy.Spider): name = 'sg' allowed_domains = [' pic.sogou.com'] base_url = "https://pic.sogou.com/napi/pc/recommend?key=homeFeedData&category=feed&start={}&len=10" start_urls = [base_url.format (0)] def parse (self Response): json_data = response.json () if json_data is not None: img_list = json_data ["data"] ["list"] for img in img_list: yield {'image_urls': [_ ["originImage"] for _ in img [0] ["picList"]]} else: return None
The above code directly calls the interface data of the first page, and the subsequent code is extracting the image address in the JSON data.
The most important line of code is as follows:
Yield {'image_urls': [_ ["originImage"] for _ in img [0] ["picList"]]}
The image_urls here is to call scrapy's built-in image download middleware with fixed parameters.
Settings.py
The file also needs to be modified, and the details are as follows:
# user agent settings USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'# Obey robots.txt rulesROBOTSTXT_OBEY = False# download interval is set to 3 seconds DOWNLOAD_DELAY = default request header DEFAULT_REQUEST_HEADERS = {'Accept':' application/json, text/plain, * / *', 'Accept-Encoding':' gzip, deflate, br', 'Accept-Language':' zh-CN,zh Qpicture 0.9, 'HOST':' pic.sogou.com',} # Open ImagesPipeline image storage pipeline ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1,} # picture storage folder IMAGES_STORE = "images"
Run the code picture will be automatically downloaded, saved to the images directory, download output the following information, this time only collect the first page of data, take care of 40 pictures.
If your code does not download the target image after running, please make sure that the following BUG appears.
ImagesPipeline requires installing Pillow 4.0.0
The solution is very simple, just install the Pillow library.
Another problem is that the file name is dynamic and looks a little messy.
Add a data pipe with a custom file name to the pipelines.py file.
Class SogouImgPipeline (ImagesPipeline): def get_media_requests (self, item, info): name = item ["name"] for index, url in enumerate (item ["image_urls"]): yield Request (url, meta= {'name': name,' index': index}) def file_path (self, request, response=None Info=None): # name name = request.meta ['name'] # Index index = request.meta [' index'] filename = u' {0} _ {1} .jpg '.format (name, index) print (filename) return filename
The main function of the above code is to rename the picture file name, and then synchronously modify the relevant code in the SgSpider class.
Def parse (self, response): json_data = response.json () if json_data is not None: img_list = json_data ["data"] ["list"] for img in img_list: yield {'name': img [0] [' title'], 'image_urls': [_ ["originImage"] for _ in img [0] ["picList"]] } else: return None
Run the code again, and after the picture is saved, the file name becomes much easier to identify.
Finally, complete the logic on the next page to implement this case, which is left to you to complete.
Thank you for reading this article carefully. I hope the article "how to implement Sogou Picture Downloader with python programming scrapy simple Code" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 202
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.