Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Media Pipeline climb a girl's picture?

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces Media Pipeline how to climb the girl picture, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.

Preface

In the process of grabbing data, in addition to grabbing text data, there is also a need to grab pictures. Can our scrapy crawl for pictures? The answer is, of course. Ashamed to say, I didn't know until last month that among zone7 fans, some friends asked scrapy how to crawl image data. I didn't know until I searched it. Now summarize it and share it.

Media Pipeline

Our itempipeline processing can not only handle text messages, but also save file and picture data, namely FilesPipeline and ImagesPipeline.

Files Pipeline

Avoid re-downloading data that has been downloaded recently

Specify storage path

A typical workflow for FilesPipeline is as follows:

In a crawler, you grab an item and put the URL of the image in the file_urls group.

The project returns from the crawler and enters the project pipeline.

When the project enters the FilesPipeline,file_urls group, the URLs will be downloaded by the Scrapy scheduler and downloader (which means the scheduler and downloader middleware can be reused). When the priority is higher,-- it will be processed before other pages are fetched. The project will remain "locker" during this particular pipeline phase until the download of the file is complete (or for some reason it is not).

When the file is downloaded, another field (files) will be updated to the structure. This group will contain a dictionary list that includes information about the download file, such as the download path, the source crawl address (obtained from the file_urls group), and the image's check code (checksum). The order of the files in the files list will be consistent with the source file_urls group. If a picture fails to download, an error message will be logged and the picture will not appear in the files group.

Images Pipeline

Avoid re-downloading data that has been downloaded recently

Specify storage path

Convert all downloaded pictures to a common format (JPG) and mode (RGB)

Thumbnail generation

Detect the width / height of images to ensure that they meet the minimum limits

Enable Media Pipeline# to enable both picture and file pipes

ITEM_PIPELINES = {# when you use it, please modify it to your own ImgPipeline

'girlScrapy.pipelines.ImgPipeline': 1

}

FILES_STORE = os.getcwd () +'/ girlScrapy/file' # file storage path

IMAGES_STORE = os.getcwd () +'/ girlScrapy/img' # Picture Storage path

# avoid downloading the contents of files that have been downloaded in the last 90 days

FILES_EXPIRES = 90

# avoid downloading image content that has been downloaded in the last 90 days

IMAGES_EXPIRES = 30

# set image thumbnails

IMAGES_THUMBS = {

'small': (50,50)

'big': (250,250)

}

# Image filter, minimum height and width, no download below this size

IMAGES_MIN_HEIGHT = 128,

IMAGES_MIN_WIDTH = 128,

It is important to note that the name of the picture you downloaded will eventually be named after the hash value of the picture URL, for example:

0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg

The final save address is:

Your/img/path/full/0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg

Use ImgPipeline

This is an ImgPipeline in my demo, which rewrites two methods.

From scrapy.pipelines.images import ImagesPipeline

Class ImgPipeline (ImagesPipeline): # inherits the class ImagesPipeline

Def get_media_requests (self, item, info):

For image_url in item ['image_urls']:

Image_url = image_url

Yield scrapy.Request (image_url)

Def item_completed (self, results, item, info):

Image_paths = [x ['path'] for ok, x in results if ok]

If not image_paths:

Raise DropItem ("Item contains no images")

Return item

They are:

Get_media_requests (self, item, info):

Item_completed (self, results, item, info):

Get_media_requests (self, item, info):

Here, we can get the parsed item value in parse, so we can get the corresponding image address. Return a scrapy.Request (image_url) here to download the image.

Item_completed (self, results, item, info):

Item and info print out a list of url addresses. Where results is printed out is the following value.

# success

[(True, {'path':' full/0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg'

'checksum':' 98eb559631127d7611b499dfed0b6406'

'url': 'http://mm.chinasareview.com/wp-content/uploads/2017a/06/13/01.jpg'})]

# error

[(False)

Failure (...)]

Grab the girl's picture.

Ok, the theoretical part is over, so let's practice it.

Spider

The spider section is simple, as follows:

Class GirlSpider (scrapy.spiders.Spider):

Name = 'girl'

Start_urls = ["http://www.meizitu.com/a/3741.html"]"

Def parse (self, response):

Soup = BeautifulSoup (response.body, 'html5lib')

Pic_list = soup.find ('div', id= "picture"). Find_all (' img') # find all the pictures in the interface

Link_list = []

Item = ImgItem ()

For i in pic_list:

Pic_link = i.get ('src') # get the specific url of the image

Link_list.append (pic_link) # extract Picture Link

Item ['image_urls'] = link_list

Print (item)

Yield item

Itemclass ImgItem (scrapy.Item):

Image_urls = scrapy.Field () # links to pictures

Images = scrapy.Field ()

ImgPipelineclass ImgPipeline (ImagesPipeline): # inherits the class ImagesPipeline

Def get_media_requests (self, item, info):

For image_url in item ['image_urls']:

Image_url = image_url

Yield scrapy.Request (image_url)

Def item_completed (self, results, item, info):

Image_paths = [x ['path'] for ok, x in results if ok]

If not image_paths:

Raise DropItem ("Item contains no images")

Return item

Start scrapy crawl girl

The final crawl result is as follows:

So much for sharing about how Media Pipeline climbs the girl's picture. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report