Analysis of how to parse Database Compression Technology 04/19 Update SLTechnology News&Howtos

Analysis of how to parse Database Compression Technology

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to analyze the analysis of database compression technology, which may not be well understood by many people. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

As a database, under the premise of system resources (CPU, memory, SSD, disk, etc.), we want to:

Store more data: with compression, there are a variety of compression algorithms in the world

Faster access: faster compression (write) / decompression (read) algorithms, larger cache.

Almost all compression algorithms rely heavily on context:

Data adjacent to each other generally have higher correlation and greater inherent redundancy.

The larger the context, the higher the upper limit of the compression ratio (there is a limit).

Block compression

Block Compression Technology in traditional Database

For ordinary data block / file compression, the traditional (streaming) data compression algorithm works well and takes a long time, and everyone is used to this mode of data compression. Data compression algorithms based on this pattern emerge one after another, and there are new algorithms to implement. Including the most widely used gzip, bzip2, Google Snappy, rookie Zstd and so on.

Gzip is supported on almost all platforms and has become an industry standard, with a balanced compression ratio, compression speed and decompression speed.

Bzip2 is a kind of compression based on BWT transform, which essentially divides the input into blocks, and each block is compressed separately. The advantage is that the compression ratio is very high, but the compression and decompression speed is relatively slow.

Snappy is produced by Google, which has the advantages of fast compression and decompression, but its disadvantage is that the compression ratio is relatively low, so it is suitable for real-time compression scenes that do not require high compression ratio.

LZ4 is a strong competitor to Snappy, which is faster than Snappy, especially the decompression speed.

Zstd is a compression rookie, with a much higher compression ratio and a slightly lower compression and decompression speed than both LZ4 and Snappy; compared with gzip, the compression ratio is about the same, but the compression / decompression speed is much higher.

For the database, in the Archaea of the computer world, the Btree optimized for Imax O has always been unshakeable, and the Btree block/page size optimized for disk is relatively large, which just allows the traditional data compression algorithm to get a larger context, so the compression based on block/page is naturally applied to all kinds of databases. In this wild era, the performance and capacity of memory is clearly different from that of disk, and the performance requirements of various applications are relatively small, and everyone is at peace with each other.

Now that we have SSD, PCIe SSD, 3D XPoint, etc., the memory is getting larger and larger, and the disadvantage of block compression is becoming more and more prominent:

If the block selection is small, the compression ratio is not enough; if the block selection is large, the performance is unbearable.

What is more deadly is that block compression saves only larger and cheaper disks and SSD.

More expensive and smaller memory is not only not saved, but more wasted (dual cache problem).

Therefore, for many applications with high real-time requirements, compression can only be turned off.

The principle of block compression

Using general compression techniques (Snappy, LZ4, zip, bzip2, Zstd, etc.) to compress by block / page (the block size is usually 4KB~32KB, and the TokuDB block size known for its compression ratio is 2MB~4MB), this block is a logical block, not a physical block in the concept of memory paging and block devices.

When compression is enabled, access speed slows down because:

When writing, many records are packed together and compressed into blocks, increasing the block size, the compression algorithm can get a larger context, thus improving the compression ratio; on the contrary, reducing the block size will reduce the compression ratio.

When reading, even if you are reading very short data, you need to decompress the whole block first, and then read the decompressed data. In this way, the larger the block size, the more records are contained in the same block. The more unnecessary decompression is done to read a piece of data, and the worse the performance is. Conversely, the smaller the block size, the better the performance.

Once compression is enabled, in order to alleviate the above problems, traditional databases generally need a large dedicated cache to cache the extracted data, which can greatly improve the access performance of hot data, but it also causes the space occupation problem of double cache: one is the compressed data in the operating system cache; the other is the decompressed data in the dedicated cache (such as DBCache in RocksDB). There is also an equally serious problem: after all, the dedicated cache is the cache, and when the cache is not *, you still need to extract the whole block, which is a major source of slow Query problems (another major source of slow Query is when the operating system cache is not *).

All these lead to the fact that the access speed and space occupation of the existing traditional database is a problem that can not be solved completely, so we can only take some compromise.

Block Compression of RocksDB

Take RocksDB as an example. BlockBasedTable in RocksDB is a block compressed SSTable, using block compression, the index is only located to the block, the size of the block is set in dboption, and a block contains multiple (key,value) data, such as M, so that the size of the index is reduced to 1max M:

The larger M, the smaller the size of the index.

The larger the M, the larger the size of the Block, the larger the context available to the compression algorithms (gzip, Snappy, etc.), and the higher the compression ratio.

When creating a BlockBasedTable, the Key Value is filled into the buffer one by one. When the buffer size reaches the predetermined size (the block size, of course, the general buffer size is not exactly equal to the preset block size), the buffer is compressed and written to the BlockBasedTable file, and the file offset and the * Key in the buffer (for creating index) are recorded. If the single piece of data is too large, it is larger than the preset block size. This piece of data occupies a single block (a single piece of data, no matter how large, will not be divided into multiple blocks). After all the Key Value has been written, an index is created based on the starting Key and file offset of each block previously recorded. So in the BlockBasedTable file, the data comes before the index, and the end of the file contains meta-information (the role is equivalent to the commonly used FileHeader, but the location is at the end of the file, so it is called footer).

When searching, first use searchkey to find the block where searchkey is located, then search for the block in DB Cache, and then further search for searchkey in the block. If you can't find it, read the block from disk / SSD, decompress it and put it in DB Cache. There are many implementations of DB Cache in RocksDB, including LRU Cache, Clock Cache, Counting Cache (used to count Cache*** rates, etc.), and other special Cache.

In general, the operating system has a file cache, so the same piece of data may be in both DB Cache (unzipped data) and operating system Cache (compressed data). This results in memory waste, so RocksDB provides a tradeoff: set the DIRECT_IO option in dboption to bypass the operating system Cache, so that there is only DB Cache, which saves some memory, but degrades performance to some extent.

Traditional non-mainstream compression: FM-Index

FM-Index, whose full name is Full Text Matching Index, belongs to the Succinct Data Structure family, has a certain ability to compress data, and can perform search and access directly on compressed data.

The function of FM-Index is very rich, and it has a long history. It is not a new technology, and it has been widely used in some special scenarios, but it has been lukewarm for various reasons. In recent years, FM-Index has become somewhat active. First, there is a big cow on GitHub that implements a full set of Succinct algorithms, including FM-Index, and secondly, Berkeley's Succinct project also uses FM-Index.

FM-Index belongs to Offline algorithm (all data is compressed at once and cannot be modified after compression). It is generally based on BWT transform (BWT transform is based on a suffix array). The compressed FM-Index supports the following two main operations:

Data = extract (offset, length)

{offset} = search (string), which returns multiple positions / offsets that match string (offset)

FM-Index also supports more other operations, and friends who are interested can do further research.

However, in my opinion, FM-Index has several fatal shortcomings:

The implementation is too complicated (this can be overcome by a few bulls, let alone mention it)

Low compression ratio (much worse than streaming compression such as gzip)

Search (search) and access (extract) are slow (on the fastest CPU i7-6700K in 2016, single-thread throughput does not exceed 7MB/sec)

The compression process is slow and memory-consuming (the memory consumption of Berkeley's Succinct compression process is more than 50 times that of the source data)

The data model is Flat Text, not the KeyValue model of the database.

You can convert Flat Model into KeyValue Model in a simple way: pick a character "#" that doesn't appear in either Key or Value (if you can't find such a character, you need to escape encoding), insert it before and after each Key, and Key is followed by Value. In this way, search (# key#) returns to the location where # key# appears, and we can easily get the Value.

Berkeley's Succinc project has implemented a richer Row-Column model on FM-Index 's Flat Text model, which has made great efforts and achieved certain results, but it is still far from practical.

Interested friends can carefully investigate the FM-Index to verify the author's summary and judgment.

Searchable Compression of Terark (Searchable Compression)

Terark put forward the concept of "searchable compression (Searchable Compression)", and its core is to perform search (search) and access (extract) directly on the compressed data, but the data model itself is the KeyValue model, which is much faster than FM-Index (two orders of magnitude) according to its test report.

Abandon the block compression technology of traditional database and adopt global compression

Use different global compression techniques for Key and Value

Use the global compression technology COIndex with search function for Key (corresponding to search of FM-Index)

Use the fixed-point accessible global compression technology PA-Zip (corresponding to the extract of FM-Index) for Value.

Compression of Key: CO-Index

We need to index the Key in order to search effectively and access the data we need.

In ordinary index technology, the size of the index is much larger than that of the original Key in the index. Some indexes use prefix compression, which can alleviate the expansion of the index to a certain extent, but it still can not solve the problem that the index takes up too much memory.

We proposed the concept of CO-Index (Compressed Ordered Index) and put it into practice through a data structure called Nested Succinct Trie.

Compared with the traditional data structure that implements the index, the space occupied by Nested Succinct Trie is ten or even dozens of times smaller. While maintaining this compression ratio, it also supports rich search functions:

Precise search

Range search

Sequential traversal

Prefix search

Regular expression search (not traversing one by one).

CO-Index also has its advantages over FM-Index (assuming that all the data in FM-Index is Key).

Table 1 FM-Index vs. CO-Index

The principle of CO-Index

In fact, we have implemented many kinds of CO-Index, of which Nested Succinct Trie is the most widely applicable. Here is a brief introduction to its principle:

Succinct Data Structure introduction

Succinct Data Structure is a technology that can express objects in a space close to the lower limit of information theory. It is usually represented by bitmaps and positioned by rank and select on bitmaps.

Although the memory footprint can be greatly reduced, it is more complex to implement and has much lower performance (the constant term of time complexity is very large). At present, SDSL-Lite is open source, while we use our own implementation of Rank-Select, and the performance is also higher than that of open source implementation.

Take binary tree as an example

The traditional representation is that a node contains two pointers: struct Node {Node * left, * right;}

Each node takes up 2ptr, if we optimize the traditional method, the node pointer is expressed by the minimum number of bits, N nodes need 2 * [log2 (N)] bits.

Compared with the traditional basic version and the traditional optimized version, it is assumed that there are 216nodes (including null nodes). The traditional optimized version needs 2 bytes and the traditional basic version needs 4 bytes and 8 bytes.

Compared with the traditional optimized version and Succinct, it is assumed that there are 1 billion nodes.

In the traditional optimized version, each pointer occupies [log2 (230)] = 30bits, and the total memory footprint is: ($\ frac {22030} {8} $) * 230 ≈ 7.5GB.

Using Succinct, occupy: ($\ frac {2.5} {8} $) * 230 ≈ 312.5MB (2.5 bits per node, where 0.5bits is the space occupied by the rank-select index).

Succinct Tree

There are many expressions for Succinct Tree, and here are two common ones:

Figure 1 example of Succinct Tree expression

Succinct Trie = Succinct Tree + Trie Label

Trie can be used to implement Index, and the Succinct Trie in figure 2 is expressed in LOUDS, where hat, is, it, a, and four Key are saved.

Patricia Trie plus nesting

Only using Succinct technology, the compression ratio is far from enough, so path compression and nesting are applied. As a result, the compression ratio has reached a new level.

Putting all the above technologies together, it is our Nest Succinct Trie.

Compression of Value: PA-Zip

We have developed a compression technology called PA-Zip (Point Accessible Zip): each piece of data is associated with an ID, and after the data is compressed, the corresponding ID can be used to access that data. Here, ID is the Point, so it's called Point Accessible Zip.

PA-Zip compresses all Value in the entire database (the collection of all Value in the KeyValue database) globally, rather than by block/page. This is a compression algorithm specially designed for the requirements of the database (KeyValue model) to solve the problem of traditional database compression:

The compression ratio is higher, and there is no problem of double caching. As long as the compressed data is loaded into memory and no special cache is required, you can directly read a single piece of data by ID. If this reading of a single piece of data is regarded as a kind of decompression, then:

When decompressing in ID order, the decompression speed (Throughput) is generally in 500MB per second (single thread), and * * achieves about 7GB/s, which is suitable for offline analysis, and traditional database compression can also do this.

When randomly decompressing by ID, the decompression speed is generally in 300MB per second (single thread), * reaches about 3GB/s, which is suitable for online service demand, which is better than traditional database compression: according to random decompression 300MB/s, if the average length of each record is 1K, it is equivalent to QPS = 300000; if the average length of each record is 300bytes, it is equivalent to QPS = 1 million

Prefetch (warmup), in some special scenarios, the database may need to warm up. Because the dedicated cache is removed, the prefetch of TerarkDB is relatively simple and efficient. As long as the memory of mmap is warmed up (to avoid Page Fault), the database is preheated after the database is loaded successfully. The preheated Throughput is the IO performance of continuous reads of SSD (the newer SSD read performance exceeds 3GB/s).

Compared with FM-Index, PA-Zip addresses the extract operation of FM-Index, but the performance and compression ratio are much better:

Table 2 FM-Index vs. PA-Zip

Combine Key with Value

Key is saved in CO-Index in the form of global compression, and Value in PA-Zip in the form of global compression. Search a Key and you will get an internal ID. According to this ID, go to the PA-Zip to access the Value corresponding to the ID. In the whole process, you only need to touch the data you need, without touching other data.

So no dedicated cache (such as DBCache in RocksDB), only use mmap,*** with file system cache, the whole DB only mmap file system cache this layer of cache, coupled with ultra-high compression ratio, greatly reduce the amount of memory, and greatly simplify the complexity of the system, and finally achieve a significant improvement in database performance, thus achieving ultra-high compression ratio and ultra-high random read performance.

From a higher philosophical level, our storage engine is very much derived from the construction method, because CO-Index and PA-Zip work closely together, * match the KeyValue model, function is "just enough", performance squeezes the hardware limit, and the compression ratio approaches the lower limit of information theory. Compared to other options:

Traditional block compression is derived from the general stream compression, the function of stream compression is very limited, only compression and decompression two operations, there is no compression effect for too small data blocks, and can not compress the redundancy between data blocks. Applying it to a database requires a lot of engineering effort, like attaching a car to an airplane wing and then making it fly.

In contrast to FM-Index, FM-Index is so feature-rich that it has to pay some price-- compression ratio and performance. In the KeyValue model, we only need a very small subset of its rich functions (which have to be adapted and transformed). Other functions are useless, but we still have to pay those costs, just like we spent a high price to build a plane, but we pressed it to the ground and ran on wheels as a car.

Fig. 2 Succinct Tree expressed in LOUDS

Fig. 3 path compression and nesting

Appendix

Compression ratio & performance Test comparison

Dataset: Amazon movie data

Amazon movie data (~ 8 million reviews), the total size of the dataset is about 9GB, the number of records is about 8 million, and the average length of each data is about 1K.

Benchmark code is open source: see Github repository (https://github.com/Terark/terarkdb-benchmark/tree/master/doc/movies)).

Compression ratio (see figure 4)

Fig. 4 Compression ratio comparison

Read at random (see figure 5)

Fig. 5 comparison of random read performance

This is the performance of each storage engine with sufficient memory.

Delay curve (see figure 6)

Fig. 6 comparison of delay curve

Dataset: Wikipedia English version

Wikipedia English version of all text data, 109g, compressed to 23g.

Dataset: TPC-H

On the lineitem data of TPC-H, compare the test with TerarkDB and the original RocksDB (BlockBasedTable):

Table 3 comparison test between TerarkDB and original RocksDB

API interface

TerarkDB = Terark SSTable + RocksDB

RocksDB was originally a fork of Facebook to Google's LevelDB, compatible with LevelDB on the programming interface, and added a lot of improvements.

What makes RocksDB useful to us is that its SSTable can be plugin, so we have implemented a SSTable of RocksDB to bring our technical advantages into full play through RocksDB.

Although RocksDB provides a relatively complete KeyValueDB framework, there are still some deficiencies to fully adapt to our unique technology, so we need to make some changes to RocksDB itself. We may one day submit our changes to the official RocksDB version.

Github link: TerarkDB (https://github.com/Terark/terarkdb) Magazine TerarkDB consists of two parts:

Terark-zip-rocksdb (https://github.com/terark/terark-zip-rocksdb), (Terark SSTable forrocksdb))

Terark fork rocksdb (https://github.com/Terark/rocksdb), (this modified version of rocksdb must be used)

For better compatibility, TerarkDB has not made any changes to RocksDB's API, and in order to further facilitate users to use TerarkDB, we even provide a way: the program does not need to be recompiled, just need to replace librocksdb.so and set a few environment variables to experience TerarkDB.

If users need finer control, they can use C++ API to configure various options for TerarkDB in detail.

At present, you can try it for free and do performance evaluation, but it cannot be used in production, because 0.1% of the data will be randomly deleted from the trial version.

Terark command line toolset

We provide a set of command-line tools that compress input data into different forms, and the compressed files can be unzipped or accessed at a fixed point using Terark API or other command-line tools (in the toolset).

After reading the above, do you have any further understanding of how to analyze the database compression technology? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.