What are the three ways to store and access Python pictures? 04/27 Update SLTechnology News&Howtos

What are the three ways to store and access Python pictures?

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, the editor will share with you the relevant knowledge of what the three ways of storing and accessing Python pictures are. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look.

Preface

ImageNet is a well-known public image database used to train models for object classification, detection and segmentation tasks. It contains more than 14 million images.

When dealing with image data in Python, for example, convolution neural network (also known as CNN) can be used to deal with a large number of image data sets, so we need to learn how to store and read data in the simplest way.

There should be a quantitative comparison of image data processing, how long it takes to read and write files, and how much disk memory will be used.

Different ways are used to deal with and solve the problems of image storage and performance optimization.

Prepare a data set that you can play with.

The well-known image data set, CIFAR-10, consists of 60000 32x32-pixel color images that belong to different object categories, such as dogs, cats, and airplanes. CIFAR is not a very large dataset, but if you use a full TinyImages dataset, you will need about 400GB's available disk space.

The code in this article applies to the dataset download address CIFAR-10 dataset.

This data is serialized and saved in batches using cPickle. The pickle module can serialize any Python object without any additional code or conversion. However, there is a potentially serious drawback, that is, the security risk cannot be assessed when dealing with large amounts of data.

Load the image into the NumPy array

Import numpy as npimport picklefrom pathlib import Path# file path data_dir = Path ("data/cifar-10-batches-py/") # Decoding function def unpickle (file): with open (file, "rb") as fo: dict = pickle.load (fo, encoding= "bytes") return dictimages, labels = [] for batch in data_dir.glob ("data_batch_*"): batch_data = unpickle (batch) for I Flat_im in enumerate (batch_ data [b "data"]): im_channels = [] # each image is flat Channels are arranged in the order of R, G, B for j in range (3): im_channels.append (flat_ im [j * 1024: (j + 1) * 1024] .reshape ((32) 32) # rebuild the original image images.append (np.dstack ((im_channels) # Save the tag labels.append (batch_ data [b "labels"] [I]) print ("load CIFAR-10 training set:") print (f "- np.shape (images) {np.shape (images)}") print (f "- np.shape) (labels) {np.shape (labels)} ") Settings for image storage

Install the tripartite library Pillow for image processing.

Pip install PillowLMDB

LMDB, also known as the "lightning database", represents the lightning memory-mapped database because it is fast and uses memory-mapped files. It is a key-value store, not a relational database.

Install the tripartite library lmdb for image processing.

Pip install lmdbHDF5

HDF5 stands for Hierarchical Data Format, a file format called HDF4 or HDF5. Originated from the American national supercomputing Tencent App Center, it is a portable and compact scientific data format.

Install the tripartite library h6py for image processing.

Storage of single image in pip install h6py

Three different ways to read data

From pathlib import Pathdisk_dir = Path ("data/disk/") lmdb_dir = Path ("data/lmdb/") hdf5_dir = Path ("data/hdf5/")

The data loaded at the same time can be created and saved separately.

Disk_dir.mkdir (parents=True, exist_ok=True) lmdb_dir.mkdir (parents=True, exist_ok=True) hdf5_dir.mkdir (parents=True, exist_ok=True) is stored to disk

The completion input using Pillow is a single image image, in memory as an array of NumPy, and named image_id using a unique image ID.

Save a single image to disk

From PIL import Imageimport csvdef store_single_disk (image, image_id, label): "" stores a single image on disk as a .png file. Parameter:-image image array Integer unique ID label image tag "" Image.fromarray (image) .save (disk_dir / f "{image_id} .png") with open (disk_dir / f "{image_id} .csv", "wt") as csvfile: writer = csv.writer (csvfile, delimiter= "", quotechar= "|") for image_id images in (32,32,3) format Quoting=csv.QUOTE_MINIMAL) writer.writerow ([label]) is stored to LMDB

LMDB is a key-value pair storage system in which each entry is saved as a byte array, the key will be the unique identifier of each image, and the value will be the image itself.

Both the key and the value should be strings. A common use is to serialize a value into a string and then deserialize it when read back.

The size of the image used for reconstruction. Some data sets may contain images of different sizes. This method is used.

Class CIFAR_Image: def _ init__ (self, image, label): self.channels = image.shape [2] self.size = image.shape [: 2] self.image = image.tobytes () self.label = label def get_image (self): "" returns the image as an numpy array "" image = np.frombuffer (self.image) " Dtype=np.uint8) return image.reshape (* self.size, self.channels)

Save a single image to LMDB

Import lmdbimport pickledef store_single_lmdb (image, image_id, label): "stores a single image in the LMDB parameter:-image image array Integer unique ID label image tag for (32,32,3) format image_id image "" map_size = image.nbytes * 10 # Create a new LMDB environment env = lmdb.open (str (lmdb_dir / f "single_lmdb") Map_size=map_size) # Start a new write transaction with env.begin (write=True) as txn: # All key-value pairs need to be strings value = CIFAR_Image (image, label) key = f "{image_id:08}" txn.put (key.encode ("ascii"), pickle.dumps (value) env.close () Storage HDF5

A HDF5 file can contain multiple datasets. You can create two datasets, one for images and one for metadata.

Import h6pydef store_single_hdf5 (image, image_id, label): "Save a single image to a HDF5 file parameter:-image image array Integer unique ID label image label "" # create a new HDF5 file file = h6py.File (hdf5_dir / f "{image_id} .h6", "w") # create a dataset dataset = file.create_dataset ("image", np.shape (image), h6py.h6t.STD_U8BE) in the file Data=image) meta_set = file.create_dataset ("meta", np.shape (label), h6py.h6t.STD_U8BE, data=label) comparison of file.close () storage methods

Put all three functions that save a single image into a dictionary.

_ store_single_funcs = dict (disk=store_single_disk, lmdb=store_single_lmdb, hdf5=store_single_hdf5)

Store and save the first image in CIFAR and its corresponding tags in three different ways.

From timeit import timeitstore_single_timings = dict () for method in ("disk", "lmdb", "hdf5"): t = timeit ("_ store_single_ functions [method] (image, 0, label)", setup= "image=images [0] Label=labels [0] ", number=1, globals=globals (), store_single_ tidings [method] = t print (f" Storage method: {method}, usage time: {t} ")

Let's have a table to see the comparison.

Storage method Storage time-consuming storage of multiple images using Disk2.1 ms8 KLMDB1.7 ms32 KHDF58.1 ms8 K

Similar to a single image storage method, the modified code stores a plurality of image data.

Multi-image adjustment code

Saving multiple images as .png files can be understood as calling store_single_method () multiple times. However, this does not apply to LMDB or HDF5, because each image has a different database file.

Save a set of images to disk

Store_many_disk (images, labels): "" parameter:-images image array (N, 32, 32, 3) format labels tag array (N) 1) format "" num_images = len (images) # for I, image in enumerate (images): Image.fromarray (image) .save (disk_dir / f "{I} .png") # Save all tags to the csv file with open (disk_dir / f "{num_images} .csv") "w") as csvfile: writer = csv.writer (csvfile, delimiter= ", quotechar=" | ", quoting=csv.QUOTE_MINIMAL) for label in labels: writer.writerow ([label])

Save a set of images to LMDB

Def store_many_lmdb (images, labels): "" parameter:-images image array (N, 32, 32, 3) format labels tag array (N) 1) format "" num_images = len (images) map_size = num_images * images [0] .nbytes * 10 # create a new LMDB database for all images env = lmdb.open (str (lmdb_dir / f "{num_images} _ lmdb") Map_size=map_size) # write all images in a transaction with env.begin (write=True) as txn: for i in range (num_images): # all key-value pairs must be the string value = CIFAR_Image (images [I], labels [I]) key = f "{iwrite=True 08}" txn.put (key.encode ("ascii") Pickle.dumps (value)) env.close ()

Save a set of images to HDF5

Def store_many_hdf5 (images, labels): "" parameter:-images image array (N, 32, 32, 3) format labels tag array (N) 1) format "" num_images = len (images) # create a new HDF5 file file = h6py.File (hdf5_dir / f "{num_images} _ many.h6", "w") # create the dataset dataset = file.create_dataset ("images", np.shape (images), h6py.h6t.STD_U8BE) in the file Data=images) meta_set = file.create_dataset ("meta", np.shape (labels), h6py.h6t.STD_U8BE, data=labels) file.close () prepares dataset comparison

Use 100000 images for testing

Cutoffs = [10,100,1000, 10000, 100000] images = np.concatenate ((images, images), axis=0) labels = np.concatenate ((labels, labels), axis=0) # ensure that there are 100000 images and tags print (np.shape (images)) print (np.shape (labels))

Create a calculation method for comparison

_ store_many_funcs = dict (disk=store_many_disk, lmdb=store_many_lmdb, hdf5=store_many_hdf5) from timeit import timeitstore_many_timings = {"disk": [], "lmdb": [], "hdf5": []} for cutoff in cutoffs: for method in ("disk", "lmdb", "hdf5"): t = timeit ("_ store_many_ functions [method] (images_, labels_)" Setup= "images_=images [: cutoff] Labels_=labels [: cutoff] ", number=1, globals=globals (),) store_many_ tidings [method] .append (t) # print out the method, deadline and usage time print (f" Method: {method}, Time usage: {t} ")

PLOT displays a single graph with multiple datasets and matching legends

Import matplotlib.pyplot as pltdef plot_with_legend (x_range, y_data, legend_labels, x_label, y_label, title Log=False): parameter:-x_range contains list of x data y_data contains list of y values legend_labels string legend label list x_label x axis label y_label y axis label "" Plt.style.use ("seaborn-whitegrid") plt.figure (figsize= (10 7) if len (y_data)! = len (legend_labels): raise TypeError ("the number of datasets does not match the number of tags") all_plots = [] for data, label in zip (y_data, legend_labels): if log: temp, = plt.loglog (x_range, data, label=label) else: temp = plt.plot (x_range, data, label=label) all_plots.append (temp) plt.title (title) plt.xlabel (x_label) plt.ylabel (y_label) plt.legend (handles=all_plots) plt.show () # Getting the store timings data to displaydisk_x = store_many_timings ["disk"] lmdb_x = store_many_timings ["lmdb"] hdf5_x = store_many_timings ["hdf5"] plot_with_legend (cutoffs [disk_x, lmdb_x, hdf5_x], ["PNG files", "LMDB", "HDF5"], "Number of images", "Seconds to store", "Storage time", log=False,) plot_with_legend (cutoffs, [disk_x, lmdb_x, hdf5_x], ["PNG files", "LMDB", "HDF5"], "Number of images", "Seconds to store", "Log storage time" Log=True,)

Read a single image from disk read def read_single_disk (image_id): "" parameter:-image_id image integer unique ID returns result:-images image array (N, 32, 32) 3) format labels tag array (N 1) format "" image = np.array (Image.open (disk_dir / f "{image_id} .png") with open (disk_dir / f "{image_id} .csv", "r") as csvfile: reader = csv.reader (csvfile, delimiter= ", quotechar=" | ", quoting=csv.QUOTE_MINIMAL) label = int (next (reader) [0]) return image Label reads def read_single_lmdb (image_id) from LMDB: "" parameter: integer unique ID of-image_id image returns the result:-images image array (N, 32, 32, 3) format labels tag array (N) 1) format "" # Open LMDB environment env = lmdb.open (str (lmdb_dir / f "single_lmdb") Readonly=True) # start a new transaction with env.begin () as txn: # Encode data = txn.get (f "{image_id:08}" .encode ("ascii")) # loaded CIFAR_Image object cifar_image = pickle.loads (data) # retrieve the related bit image = cifar_image.get_image () Label = cifar_image.label env.close () return image Label reads def read_single_hdf5 (image_id) from HDF5: "" parameter: integer unique ID of-image_id image returns the result:-images image array (N, 32, 32, 3) format labels tag array (N) 1) format "" # Open HDF5 file file = h6py.File (hdf5_dir / f "{image_id} .h6", "r +") image = np.array (file ["/ image"]). Astype ("uint8") label = int (np.array (file ["/ meta"). Astype ("uint8") return image, label reads from timeit import timeitread_single_timings = dict () for method in ("disk") "lmdb", "hdf5"): t = timeit ("_ read_single_ functions [method] (0)", setup= "image=images [0]) Label=labels [0] ", number=1, globals=globals (), read_single_ timings [method] = t print (f" read method: {method}, use time-consuming: {t} ") storage method to store time-consuming reading of multiple Disk1.7 msLMDB4.4 msHDF52.3 ms images

Saving multiple images as .png files can be understood as calling read_single_method () multiple times. However, this does not apply to LMDB or HDF5, because each image has a different database file.

Multi-image adjustment code

Read multiple images from disk

Def read_many_disk (num_images): "" Parameter:-number of images to be read by num_images returns the result:-images image array (N, 32, 32, 3) format labels tag array (N) 1) format "" images, labels = [], [] # Loop through all ID For image_id in range (num_images): images.append (np.array (Image.open (disk_dir / f "{image_id} .png")) with open (disk_dir / f "{num_images} .csv", "r") as csvfile: reader = csv.reader (csvfile, delimiter= ", quotechar=" | ") Quoting=csv.QUOTE_MINIMAL) for row in reader: labels.append (int (row [0])) return images, labels

Read multiple images from LMDB

Def read_many_lmdb (num_images): "" Parameter:-number of images to be read by num_images returns the result:-images image array (N, 32, 32, 3) format labels tag array (N) 1) format "" images, labels = [], [] env = lmdb.open (str (lmdb_dir / f "{num_images} _ lmdb"), readonly=True) # start a new transaction with env.begin () as txn: # read in a transaction It can also be split into multiple transactions to read for image_id in range (num_images): data = txn.get (f "{image_id:08}" .encode ("ascii")) # CIFAR_Image object Store as value cifar_image = pickle.loads (data) # retrieve related bits images.append (cifar_image.get_image ()) labels.append (cifar_image.label) env.close () return images, labels

Read multiple images from HDF5

Def read_many_hdf5 (num_images): "" Parameter:-number of images to be read by num_images returns the result:-images image array (N, 32, 32, 3) format labels tag array (N) 1) format "" images, labels = [], [] # Open the HDF5 file file = h6py.File (hdf5_dir / f "{num_images} _ many.h6", "r +") images = np.array (file ["/ images"]). Astype ("uint8") labels = np.array (file ["/ meta"]) .astype ("uint8") return images, labels_read_many_funcs = dict (disk=read_many_disk) Lmdb=read_many_lmdb, hdf5=read_many_hdf5) prepare dataset comparison

Create a calculation method for comparison

From timeit import timeitread_many_timings = {"disk": [], "lmdb": [], "hdf5": []} for cutoff in cutoffs: for method in ("disk", "lmdb", "hdf5"): t = timeit ("_ read_many_ functions [num_images)", setup= "num_images=cutoff", number=1, globals=globals () ) read_many_ tidings [method] .append (t) # Print out the method, cutoff, and elapsed time print (f "read method: {method}, No. Images: {cutoff}, time: {t} ")

Data comparison of comprehensive comparison of read and write operations

View read and write times on the same chart

Plot_with_legend (cutoffs, [disk_x_r, lmdb_x_r, hdf5_x_r, disk_x, lmdb_x, hdf5_x], [Read PNG "," Read LMDB "," Read HDF5 "," Write PNG "," Write LMDB "," Write HDF5 ",]," Number of images "," Seconds "," Log Store and Read Times " Log=False,)

Use disk space in various storage methods

Although both HDF5 and LMDB take up more disk space. It is important to note that the use and performance of LMDB and HDF5 disks depend largely on a variety of factors, including the operating system and, more importantly, the size of the data stored.

Parallel operation

Usually for large datasets, parallelization can be used to speed up operations. This is what we often call concurrent processing.

Storing to disk as a .png file actually allows full concurrency. As long as the image names are different, you can read multiple images from different threads, or write to multiple files at a time.

If you divide all CIFAR into ten groups, you can set up ten processes for each read in a group, and the corresponding processing time can be reduced to about 10% of the original.

These are all the contents of this article entitled "what are the three ways to store and access Python pictures?" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.