Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the commonly used compression algorithms for databases?

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

Database commonly used compression algorithms, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

How to interpret the compression algorithms commonly used in database

The earliest columnar database Sybase IQ used a Decomposed model that simply separated each column and then marked the location of each row with rowid.

Later column databases basically do not use rowid to mark the position of each column, and for each Block data must first calculate which column has the highest selectivity (unique value least), and then sort by selectivity so that the position of rows can be marked without using rowid.

Because the same values are more likely to be together after sorting, the compression ratio is many times higher than the normal simple column storage (that is, the Decompose model described above).

The common column database compression algorithms are as follows:

When the data is sorted, the same values must be together, and only once a value is compressed in the same Block. Note that sorting here does not refer to sorting by value size, such as 5 > 1, B > A, but rather that values aggregate the same values and sort them by frequency of occurrence (similar to each field after group by). For example:

WWWWWWWWWWWWWWBBBBBBBBBZZZZZA1

After compression, it becomes

W14B9Z5A1

Sort in order according to the frequency of occurrence. As long as you calculate the number of occurrences of each value, you can calculate the starting rowid and the number of times of the current value. This is more favorable for sql operations such as in, not in, group and so on.

Write each occurrence value in the header of the Block, and then use 1 and 0 to indicate yes or no. This algorithm requires that each value that appears must have a very high selectivity and the difference between different values can not be as large as dozens of times, unlike Run Length Encoding, which only needs a high overall selectivity.

How to interpret the compression algorithms commonly used in database

The BitMap Index of a row database is generally a whole block of files (and of course local index if it is local partition). The column database is organized according to each block and each block, and the values that appear in its bitmap index are organized by all the values that appear in the current block. Some columnar databases have some variants: for example, adding Run length encoding between 1 and 0 to continue compression or adding Null Compression to exclude cases with more null values.

Data Dictionary

Row database is generally selected in this way, the commonly used value of Block is placed in the block header, and the actual place where this value appears is replaced by a tag. There is generally a compression level in the row database. The higher the compression level is, the longer the compression time is, the smaller the benefit is, and the decompression time remains unchanged. The Data Dictionary compression of the column database is also the actual value before the compression is stored in the head of the block, but unlike the row database, the column database stores all the values, even if this value appears only once, so that it can perform normal operations in terms of invisible indexes and delayed materialization without decompressing the data.

Delta Compression

Delta compression is suitable for situations where the first half of the data is the same, such as large integers and long, regular strings. Delta compression records a base value, and then each value after that only uses their different parts, such as phone number, URL, address, IP, and so on.

LZO

LZO or other similar binary compression algorithm gzip,zip,rar,7zip. Bzip and so on are also used in column databases, but only for archived data that is rarely read (a very small number of rows at a time) and large chunks of text (web page indexes, etc.).

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report