Talking about Storage re-deletion Compression Technology (1) 07/13 Update SLTechnology News&Howtos

Talking about Storage re-deletion Compression Technology (1)

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

A brief discussion on re-deletion Compression Technology (1)

As a storage person in the enterprise storage market, I have been constantly provoked by re-deletion and compression in the past two years. There are different views on whether the re-deletion compression technology is good or bad, the real demand or the pseudo demand. Today I can only talk about my personal opinion.

For more information, please follow the official account "new_storage"

What is re-deletion compression?

Re-deletion and compression are two completely different techniques to solve different problems.

Re-delete: that is, there are many points of the same data, I only store one of them, the other duplicate data blocks I keep an address to reference to this unique storage block.

Compression: Mark the substring of a large string with a very short number, then retrieve the location of the string and replace it with a simple character. So as to reduce the space needed for data expression and bring space savings.

For example, 1 for "AB", 2 for "CD", and then 255 for "hanfute". Only 8 bit are needed from 1to 255, while "AB", "CD" or "hanfute" need a lot of space, so that after multiple scans and substitutions, the data can be quickly reduced.

In popular words: re-deletion means that the same thing is stored only once, while compression is to transform the data layout and use an algorithm to calculate the data layout pattern, so as to reduce the data storage pattern.

Implementation of re-deletion

The implementation technology of re-deletion is relatively simple. The simplest use is our mail server. I forward an email to 100 people, and when you receive my email, you will generate 100 identical files. Assume that everyone's data disk uses shared storage. Storage only needs to check whether the file is locally available when everyone deposits the file, and I will no longer store it. In this way, only one file is stored on storage. This is the simplest understanding.

There are several issues involved:

1. How does storage know that this file has been stored by itself?

2, what if the file is not saved, but the block storage?

How does the storage know that this file already has it?

In the computer, there is a technical name called "fingerprint", which is very vivid, as if everyone's fingerprint must be different, so can we use a small amount of data to mark the unique information of a file?

There are many algorithms that can quickly get a unique value, such as MD5 algorithm and Sha algorithm.

L Sha algorithm is an irreversible data encryption algorithm, which can only calculate the fingerprint, but can not deduce the content from the fingerprint.

He can convert a data less than 2 ^ 64 into a 160-bit non-repetitive fingerprint, and the most important thing is that his calculation is still very fast.

So to compare whether the two data are the same, you can calculate his fingerprint and then compare the fingerprint instead of comparing the data byte by byte. It's much more efficient.

Is it possible to repeat this fingerprint, for example, the same fingerprints of two people?

According to the sha256 algorithm, the probability of duplicating two fingerprints in 4.8 * 10 ^ 29 data is probably less than 10 ^-18.10 ^-18, which is what we call the reliability of 16 9s.

Convert it into a storage language. Let's talk about it. If we say that our storage writes 100000 files per second and works according to the storage 7 "24" 365 days, then the data written each year is 365 "24" 3600 "10 000" 3.15 * 10 ^ 12 files. If you want the storage to have a hash collision that results in re-deletion of data (the probability is greater than 10 ^-18), you need to run 1.52 * 10 ^ 17 years, which may be encountered once.

In fact, the reliability of our mainstream storage devices is generally 99.9999%, that is, the six 9s we often talk about, which is far less reliable than the hash. This is also what many people worry about whether the re-deletion will delete my data and cause my data to be damaged. In fact, there is no need to worry about it.

But some people will still be worried. What should I do? There is another way, that is, when I encounter a new data, I use two algorithms to store two hash values and encounter duplicate data for double hash comparison.

However, some people are still worried about the hash algorithm, and it is also simple. For duplicate data, we do a byte-by-byte comparison again, but it will slightly affect performance.

If it is not a file, what should I do with block storage?

The implementation of deduplication technology in block storage is diversified.

The simplest and most basic way is to directly fix the length and re-delete. So the written data is sliced according to the fixed length, sliced and calculated by hash, and then written. The non-duplicated data is written separately, and the duplicated data is written to the reference.

However, the re-deletion rate of this processing method is relatively low. For example, for a file, we only add one character to the file and then rewrite it. After the file is sliced in a fixed-length way, the same block cannot be found and the data cannot be re-deleted. Therefore, the industry also has a lot of edge length re-deletion algorithm.

However, variable length re-deletion requires high performance and algorithm, and consumes a lot of memory for CPU, which affects the efficiency of real-time data processing. After all, the storage mainly handles the IO read and write responses of the host. It is only used more in the area of backup and archiving, because this scenario requires much more space savings than rapid response.

Take the following picture as an example, the efficiency of longer re-deletion may reach 10:1, while that of fixed-length re-deletion is only 3:1.

Therefore, for all-flash storage with high response requirements, it is recommended that fixed length re-delete, fast. For cold storage such as archiving and backup, it is recommended to increase the length and delete again, resulting in high re-deletion rate and cost saving.

Re-delete summary

In fact, re-deletion is not very useful in the all-flash market, because in many cases, the effect of fixed-length re-deletion is very limited, such as a typical database scenario, the re-deletion rate is almost negligible when the re-deletion rate is only 1.05.

Compression is more efficient for all-flash memory, so let's take a look at compression technology.

Realization of Compression Technology

Compression technology has a long history, which is divided into lossless compression and lossy compression.

Lossy compression is mainly used in the field of image processing. For example, when I send a photo on Wechat, there is a 300K picture when a local 10m high-definition picture is transmitted to a friend's mobile phone. This is mainly to save network traffic and Wechat storage space savings.

The compression used in the field of storage systems is lossless. With the popularity of algorithms, there is almost no algorithm difference in the compression implementation of the mainstream storage manufacturers in the industry, but it lies in the choice of compression implementation, mainly considering both performance and data reduction rate.

So how much impact does compression have on storage performance?

How much does compression affect the performance of storage?

Based on EMC Unity Sizer's performance evaluation tool, we can probably see that compared with not enabling compression, the IOPS is reduced from about 200000 to 120000, and the storage performance is reduced by about 40%.

In fact, compression algorithms have already been integrated into our latest intel CPU. Last time, I had a private understanding of the data with our test manager. When compression is enabled, storage performance stress tests are carried out at full load, and the storage CPU utilization is 75%. Less than 3% of the CPU resources are consumed for compression. Why has storage performance declined so much?

Reduced architectural efficiency of ROW due to compression

Our traditional storage, when there is no need for compression, each of our data is provided by our own fixed address on the hard disk. For example, LUN1's LBA00xx64~00x128 is stored on a contiguous 64bit address starting with bit X in the lower 8 sectors of disk 5. If I use 8KB as the minimum block size for storage, then each 8KB is stored on the specific physical address of a fixed 8KB physical disk. It was monopolized by me when I first wrote it.

In the future, no matter how this 8KB is rewritten and read, it will be 8KB. The way to record the location of this data storage is very simple. If a LUN has a total of 1TB, then I record that the 1TB is distributed in several disks, and the physical address of the 1TB distributed on that disk is easily calculated with a very simple algorithm. I only need to record a total of several disks, a total of several RAID groups, each RAID stripe depth is what, what is the starting address, I can quickly use these basic data to calculate the physical address of the data in memory.

This basic write mode is called COW (copy on write), which means copy before writing.

The traditional RAID mode determines that as long as we rewrite a bit, we need to read the original data and check data at the same time, and then calculate it in memory and then write it in. The reason for reading is that I can restore it in order to write failure.

Copying before writing does not refer to this problem, but refers to how to write when there is a snapshot of data. At this time, we can not destroy the data of the snapshot, so we can only copy the data from the original location to a special snapshot storage area. It's called COW, and it's a word invented as opposed to ROW (redirect on write).

Many people in China call COW a "rely on" architecture.

Since the compressed 8KB data may become 1Kb, 2KB, 3KB or 8KB, then my data is a variable length. If I still use the one-to-one correspondence between physical address and logical address, I will not be able to achieve the effect of saving space. I compressed a block of 8KB into 1KB, and you still allocated me 8KB physical space to store, which is simply not appropriate. Therefore, in the implementation of compression, storage is generally implemented by ROW architecture.

ROW brings those performance degradation 1, because each block of ROW architecture needs to store an address mapping relationship separately, so the larger the capacity, the greater the amount of metadata generated, so the larger the capacity of ROW architecture, the worse the performance.

In order to better deal with data, it must be thought that it is most efficient to cache all metadata in memory, so ROW architecture storage requires a lot of memory.

2. Because ROW architecture needs to record address metadata for each write, for the sake of reliability, we definitely need persistence, and each time we need the metadata to go down the disk. In this way, a write will result in two operations, writing metadata and writing data.

3. Because the data in ROW architecture is written with a new address, the logically continuous data will be discretized continuously, and eventually the continuous IO will become a random IO, which has a great impact on performance.

4, ROW brings another problem, take the above figure as an example, if we do not have a snapshot, then C this data block is an invalid data, but we will not immediately delete this data when writing, because it will affect performance. We need to specifically deal with these failed blocks when there is no contiguous space or when the business is idle. This is what we often call garbage collection, garbage collection has a great impact on performance, many manufacturers simply do not recycle, but directly fill in the blanks to write. Either way, the reuse of garbage space is an operation that has a great impact on performance.

These problems are more obvious in traditional hard disk scenarios, which is one of the reasons why the performance of Netapp has been criticized in the HDD era.

The data processing inside the SSD disk is similar. The performance degradation caused by turning on garbage collection in SSD is called "write cliff".

Compressed summary:

The impact of compression on storage performance does not come from compression itself at all, but from the architecture that implements compression.

According to the current software architecture and efficiency of the mainstream storage vendors in the industry, the performance of the general ROW architecture is about 35% lower than that of the COW architecture, while the performance loss caused by compression itself is generally less than 5%, so for the whole storage system, the performance decline of enabling compression is about 40%.

What is the impact of the re-deletion on the ROW architecture?

Compared with the compression written directly after the compression is calculated in memory, the impact of re-deletion is greater:

1, a separate space is needed to store fingerprints (bringing memory support storage space is getting smaller and smaller)

2. Fingerprint comparison is required for each write (read and write delay increased)

3, the writing of a new data block is greatly magnified (once in the fingerprint database, once in the block, and once in the metadata record), so most of the time the performance brought by re-deletion is mainly due to delay.

Extreme case: in a typical extreme case, if it is a HDD storage environment, we assume that the fixed-length block size of our ROW system is 8KB. If I write data into a 128KB, it will be sliced into 16 slices, and the final latency can reach 48 times that of the HDD itself. If a HDD response is 5ms, then the response delay of the entire IO is more than 200ms. This is almost unacceptable for SAN storage.

How to achieve efficient re-deletion compression

The impact of re-deletion compression on performance, we all know how to reduce the performance impact of storage compression, which we will introduce in detail in the next article. Please look forward to it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.