In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces Media Pipeline how to climb the girl picture, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.
Preface
In the process of grabbing data, in addition to grabbing text data, there is also a need to grab pictures. Can our scrapy crawl for pictures? The answer is, of course. Ashamed to say, I didn't know until last month that among zone7 fans, some friends asked scrapy how to crawl image data. I didn't know until I searched it. Now summarize it and share it.
Media Pipeline
Our itempipeline processing can not only handle text messages, but also save file and picture data, namely FilesPipeline and ImagesPipeline.
Files Pipeline
Avoid re-downloading data that has been downloaded recently
Specify storage path
A typical workflow for FilesPipeline is as follows:
In a crawler, you grab an item and put the URL of the image in the file_urls group.
The project returns from the crawler and enters the project pipeline.
When the project enters the FilesPipeline,file_urls group, the URLs will be downloaded by the Scrapy scheduler and downloader (which means the scheduler and downloader middleware can be reused). When the priority is higher,-- it will be processed before other pages are fetched. The project will remain "locker" during this particular pipeline phase until the download of the file is complete (or for some reason it is not).
When the file is downloaded, another field (files) will be updated to the structure. This group will contain a dictionary list that includes information about the download file, such as the download path, the source crawl address (obtained from the file_urls group), and the image's check code (checksum). The order of the files in the files list will be consistent with the source file_urls group. If a picture fails to download, an error message will be logged and the picture will not appear in the files group.
Images Pipeline
Avoid re-downloading data that has been downloaded recently
Specify storage path
Convert all downloaded pictures to a common format (JPG) and mode (RGB)
Thumbnail generation
Detect the width / height of images to ensure that they meet the minimum limits
Enable Media Pipeline# to enable both picture and file pipes
ITEM_PIPELINES = {# when you use it, please modify it to your own ImgPipeline
'girlScrapy.pipelines.ImgPipeline': 1
}
FILES_STORE = os.getcwd () +'/ girlScrapy/file' # file storage path
IMAGES_STORE = os.getcwd () +'/ girlScrapy/img' # Picture Storage path
# avoid downloading the contents of files that have been downloaded in the last 90 days
FILES_EXPIRES = 90
# avoid downloading image content that has been downloaded in the last 90 days
IMAGES_EXPIRES = 30
# set image thumbnails
IMAGES_THUMBS = {
'small': (50,50)
'big': (250,250)
}
# Image filter, minimum height and width, no download below this size
IMAGES_MIN_HEIGHT = 128,
IMAGES_MIN_WIDTH = 128,
It is important to note that the name of the picture you downloaded will eventually be named after the hash value of the picture URL, for example:
0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg
The final save address is:
Your/img/path/full/0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg
Use ImgPipeline
This is an ImgPipeline in my demo, which rewrites two methods.
From scrapy.pipelines.images import ImagesPipeline
Class ImgPipeline (ImagesPipeline): # inherits the class ImagesPipeline
Def get_media_requests (self, item, info):
For image_url in item ['image_urls']:
Image_url = image_url
Yield scrapy.Request (image_url)
Def item_completed (self, results, item, info):
Image_paths = [x ['path'] for ok, x in results if ok]
If not image_paths:
Raise DropItem ("Item contains no images")
Return item
They are:
Get_media_requests (self, item, info):
Item_completed (self, results, item, info):
Get_media_requests (self, item, info):
Here, we can get the parsed item value in parse, so we can get the corresponding image address. Return a scrapy.Request (image_url) here to download the image.
Item_completed (self, results, item, info):
Item and info print out a list of url addresses. Where results is printed out is the following value.
# success
[(True, {'path':' full/0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg'
'checksum':' 98eb559631127d7611b499dfed0b6406'
'url': 'http://mm.chinasareview.com/wp-content/uploads/2017a/06/13/01.jpg'})]
# error
[(False)
Failure (...)]
Grab the girl's picture.
Ok, the theoretical part is over, so let's practice it.
Spider
The spider section is simple, as follows:
Class GirlSpider (scrapy.spiders.Spider):
Name = 'girl'
Start_urls = ["http://www.meizitu.com/a/3741.html"]"
Def parse (self, response):
Soup = BeautifulSoup (response.body, 'html5lib')
Pic_list = soup.find ('div', id= "picture"). Find_all (' img') # find all the pictures in the interface
Link_list = []
Item = ImgItem ()
For i in pic_list:
Pic_link = i.get ('src') # get the specific url of the image
Link_list.append (pic_link) # extract Picture Link
Item ['image_urls'] = link_list
Print (item)
Yield item
Itemclass ImgItem (scrapy.Item):
Image_urls = scrapy.Field () # links to pictures
Images = scrapy.Field ()
ImgPipelineclass ImgPipeline (ImagesPipeline): # inherits the class ImagesPipeline
Def get_media_requests (self, item, info):
For image_url in item ['image_urls']:
Image_url = image_url
Yield scrapy.Request (image_url)
Def item_completed (self, results, item, info):
Image_paths = [x ['path'] for ok, x in results if ok]
If not image_paths:
Raise DropItem ("Item contains no images")
Return item
Start scrapy crawl girl
The final crawl result is as follows:
So much for sharing about how Media Pipeline climbs the girl's picture. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.