Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to correctly use the FilesPipeline that comes with Scrapy

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to correctly use the FilesPipeline that comes with Scrapy". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to correctly use the FilesPipeline that comes with Scrapy".

Scrapy's own FilesPipeline and ImagesPipeline are very convenient for downloading pictures and files, and according to its official documentation [1], we can easily open these two Pipeline.

If you just want to download pictures, you can use both FilesPipeline and ImagesPipeline. After all, pictures are also files. However, because using ImagesPipeline requires a separate installation of the third-party library Pillow, let's take FilesPipeline as an example.

Suppose the crawler parses the source code of the web page and gets a picture with the address of https://kingname-1257411235.cos.ap-chengdu.myqcloud.com/640.gif, of course, png, jpg, or even rar, pdf, zip.

In order to download this image using the FilesPipeline that comes with Scrapy, we need to make a few steps to set up.

Define items

First of all, define any items, you need to make sure that the items must contain file_urls field and files field, in addition to these two necessary fields, you can also add other fields.

Start FilesPipeline

In settings.py, find the ITEM_PIPELINES configuration, and if it is commented, uncomment it. Then add the following configuration:

'scrapy.pipelines.files.FilesPipeline': 1

Add another configuration item, FILES_STORE, whose value is the address of the folder where you want to save the picture.

After modification, it is shown in the following figure:

Download pictures

Next, we get into our specific crawler logic. In the crawler, you extract the URL of one or more pictures in any parse function and put it into the file_urls field in the item as a list. This is shown in the following figure.

Note that the files field does not need to set any values at this time. Other non-essential fields can only be set according to your needs.

Get the result

Because we set the priority of scrapy.pipelines.images.FilesPipeline to 1, which is the highest priority, it will run before all other Pipeline. Therefore, we can check the files field of the item in the other Pipeline, and we will find that the image address we need is already in it. As shown in the following figure:

The files field in item becomes a list containing dictionaries. In the dictionary, there is a Key called path, whose value is the path of the picture on the computer. For example, full/7f471f6dbc08c2db39125b20b0471c3b21c58f3e.gif represents the 7f471f6dbc08c2db39125b20b0471c3b21c58f3e.gif file in the full folder in the images folder, as shown in the following figure:

The file name is the MD5 value of the file. If you want to rename it, you can find the file according to the value of path in the subsequent pipeline, and then change the name.

Modify request header

When you see here, will you have a question: when using FilesPipeline, will Scrapy add a request header? Which request header will it use?

In fact, Scrapy does not set the request header when using FilesPipeline and ImagesPipeline. If the site monitors the request header for a request for a picture or file, you can immediately see that the request was initiated through Scrapy.

To prove this, we can look at the source code of FilesPipeline:

In the scrapy/pipelines/files.py file, you can see that FilesPipeline constructs the request object for the picture through the get_media_requests method. This request object does not set any request headers.

The screenshot above is the source code of the old version of Scrapy. In the new version of the source code, get_media_requests might look like this:

Def get_media_requests (self, item, info): urls = ItemAdapter (item). Get (self.files_urls_field, []) return [Request (u) for u in urls]

To manually add the request header, we can write a pipeline that inherits FilesPipeline but overrides the get_media_requests method, as shown in the following figure:

Note that in practice, you may also add Host and Referer.

Then modify the ITEM_PIPELINES in settings.py to point to the pipeline we customized:

In this way, FilesPipeline can correctly add the request header.

Thank you for your reading, the above is the content of "how to correctly use the FilesPipeline that comes with Scrapy". After the study of this article, I believe you have a deeper understanding of how to correctly use the FilesPipeline that comes with Scrapy, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report