Data compression: brief 07/04 Update SLTechnology News&Howtos

Data compression: brief

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1. Decide which objects to compress

The amount of space saved during ROW and PAGE compression is evaluated by sp_estimate_data_compression_savings.

The table has a better compression effect when it contains the following data schema:

Numeric type columns and fixed-length character type data, but most of the values of both do not use all bytes of this type. For example, most of the values of INT column are less than 1000.

Columns that are allowed to be NULL have many null values

There are many of the same values or the same prefix in the column values.

The compression effect is poor when the table contains the following data patterns:

Numeric type columns and fixed-length character type data, but most of the values of both use up all bytes of this type.

A very small number of duplicate values

Duplicate values do not have the same prefix

Data is stored outside the row

FILESTREAM data

two。 Evaluate the application load pattern

Compressed pages are compressed on disk and in memory. It will be decompressed in the following two cases (not the whole page, but only the relevant data):

Read because of filtering, sorting, joining operations in the query

Updated by the application

Decompression consumes CPU, but data compression reduces physical IO and logical IO, while improving cache efficiency. For data scanning operations, the reduced amount of IO is considerable. For a single search operation, the amount of IO is reduced less.

The CPU overhead caused by row compression is usually no more than 10%. If the current system resources are sufficient and there is no pressure to increase 10%CPU, it is recommended that all tables enable row compression.

CPU overhead of page compression is higher than that of row compression, so it is more difficult to determine whether to use page compression. We can use some simple guidelines to help us judge:

Start with tables and indexes that are not commonly used

If the system does not have enough CPU margin, do not use page compression

Because filtering, joins, aggregates, and sorting operations use unzipped data, data compression is not very helpful for such queries. If the workload consists mainly of very complex queries (multi-table JOIN, complex aggregation), page compression will not improve performance, the most important thing is to save storage space.

In large-scale data warehouse systems, scanning performance is the focus, and the cost of storage equipment is high. Under the CPU performance, it is recommended to use page compression for all tables.

You can use two finer metrics to help us evaluate which data compression method to use:

U: percentage of update operations for specific objects (tables, indexes, or partitions) as a percentage of all operations. The lower the lower, the more suitable for page compression.

S: the percentage of scan operations for specific objects (tables, indexes, or partitions) as a percentage of all operations. Higher is more suitable for page compression.

Query the U of all objects in the database through the following script:

SELECT o.name AS [Table_Name], x.name AS [Index_Name], i.partition_number AS [Partition], i.index_id AS [Index_ID], x.type_desc AS [Index_Type] I.leaf_update_count * 100.0 / (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count) AS [Percent_Update] FROM sys.dm_db_index_operational_stats (db_id (), NULL, NULL NULL) iJOIN sys.objects o ON o.object_id = i.object_idJOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_idWHERE (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count)! = 0AND objectproperty

Query the S of all objects in the database through the following script:

SELECT o.name AS [Table_Name], x.name AS [Index_Name], i.partition_number AS [Partition], i.index_id AS [Index_ID], x.type_desc AS [Index_Type] I.range_scan_count * 100.0 / (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count) AS [Percent_Scan] FROM sys.dm_db_index_operational_stats (db_id (), NULL, NULL NULL) iJOIN sys.objects o ON o.object_id = i.object_idJOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_idWHERE (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count)! = 0AND objectproperty

These two queries use DMV sys.dm_db_index_operational_stats. DMV only records the accumulated value since the last SQL Server instance was started, so it is necessary to select an appropriate time to query in practical application.

Usually U75% would be a reasonable starting point to consider enabling compression, but for pipelining tables that insert only ordered data, page compression is more appropriate (even if the S value is low).

3. Assess resource requirements

Use ALTER TABLE... REBUILD and ALTER INDEX... REBUILD enables compression for tables and indexes, and other principles are the same as re-indexing. Resources usually required include space, CPU, IO, and space requirements

In the process of compression, compressed and uncompressed tables coexist, and only after compression is completed will the uncompressed table be deleted and free space. If Rebuild is ONLINE, there is also Mapping Index that needs extra space.

The space requirement of the transaction is determined by whether the compression mode is ONLINE (ON or OFF) and the recovery model of the database.

When SORT_IN_TEMPDB=ON (recommended ON), in order to implement concurrent DML operations, the internal structure of Mapping index in tempdb is used to map the relationship between old and new bookmarks. For versioned storage, tempdb usage is determined by the amount of data and transaction length involved in concurrent DML operations.

Typically, the CPU overhead of a row compression operation is about 1.5 times that of rebuilding an index, and page compression is 2 to 5 times that of it. The ONLINE mode also requires additional CPU resources. Rebuild and Compress can be parallelized, so consider them in conjunction with MAXDOP.

Considerations for parallelization:

When SQL Server Create/Rebuild/Compress an index, it uses the statistics of the first column (leftmost column) of the index to determine the distribution of parallel operations across multiple CPU. So parallelization is of little help to performance improvement when the filtering of the first column of the index is not high, or when the data skew is so severe that the value of the first column is very small.

Using ONLINE=ON to stack tables in Compress/Rebuild is a single-threaded operation. But table scan operations before compression and reconstruction are parallel multithreaded.

The following table summarizes and compares the resource costs of compressing and rebuilding a clustered cable:

X = number of pages before compressing or rebuilding

P = number of pages after compression (P

< X) Y = 新增和被更新的页数据 (只适用于ONLINE=ON时并发应用所做修改) M = Mapping index的大小 (基于白皮书的预估值) C = 重建聚集索引所需CPU时间

Here are some reference points when and how to compress data:

Online vs. Offline:

Offline is faster and requires fewer resources, but the table is locked during the compression operation. Online itself has some limitations.

Compress one table/index/partition vs at a time. Concurrency of multiple operations:

This is determined by the margin of the current resources, if the resources are sufficient, multiple compression operations can also be accepted in parallel, otherwise it is best to do one at a time.

Order of table compression operations:

Starting from the small table, the compression of the small table requires less resources and can be completed quickly. The resources released after completion are also beneficial to the compression operation of subsequent tables.

SORT_IN_TEMPDB= ON or OFF:

ON is recommended. In this way, tempdb can be used to store and complete Mapping index operations, thus reducing the space requirements of user data.

Side effects of compression operation:

The compression operation includes a rebuild operation, so the fragmentation on the table or index is removed.

When compressing a heap table, if a nonclustered index exists, then: when ONLINE=OFF, index reconstruction is a serial operation, ONLINE=ON, index reconstruction is a parallel operation.

4. Maintain compressed data

Compression of newly inserted data

* convert row-level compressed pages to page-level compression by rebuilding the heap table in page compression.

* * in page compression, not all pages are compressed, only when the amount of space saved by page compression exceeds a memory threshold.

Update and delete compressed rows

All updates to row compression table / partition data rows remain in row compression format. Not every update to a data row of a page compression table / partition causes column prefixes and page dictionaries to be recalculated, but only if the number of updates on the exceeds a certain internal threshold.

The behavior of auxiliary data structures

Table compression

Transaction log

Mapping index for rebuilding the clustered index

Sort pages for queries

Version store (with SI or RCSI isolation level)

ROW

NONE

ROW

PAGE

ROW

NONE

ROW

Non-page-level pages of a page compression index are row-compressed

The non-leaf level of the index is relatively small, and even if page compression is applied, the space savings will not be significant. Non-leaf pages are accessed frequently, and row-level compression is used to reduce the cost of decompressing each visit.

5. Reclaim the free space released by data compression

Do not recycle, save for the growth of data that will be used. This is not suitable for read-only partitions in partitioned tables (each partition corresponds to a different file level). Compressing the old read-only partition will not grow, and compression can save a lot of space.

DBCC SHRINKFILE (or DBCC SHRINKDATABASE). This operation brings a lot of fragmentation, and it is a single-threaded operation that can take a long time.

If you compress all the tables on a filegroup, create a new filegroup, and then move the tables and indexes to the new filegroup when you compress. Data movement can be achieved through Create/Recreate clustered indexes (for example, WITH (DATA_COMPRESSION=PAGE, DROP_EXISTING=ON, SORT_IN_TEMPDB=ON) ON [FG_NEW]). After moving the data, delete the original filegroup. However, you cannot move LOB_DATA data to a new filegroup in this way.

Create compressed tables on the new filegroup, and then import data into those tables.

6. BULK INSERT and data compression

BULK INSERT WITH (TABLOCK) imports data into compressed tables, which is the fastest. Obviously, it locks the watch.

When compressing data, consider the order in which BULK INSERT and clustered indexes are created:

Serial number

Mode

Compare

one

BULK INSERT imports data into an uncompressed heap table, and then CREATE CLUSTERED INDEX WITH (DATA_COMPRESSION = PAGE).

Time required: 13

three

BULK INSERT imports data into a page-compressed clustered index

7. Data compression and partition table maintenance

1. The Switch operation requires the target partition (or destination table) to be compressed in the same way as the source partition.

2. The partition after Split inherits the compression mode of the original partition.

3. For Merger operation, the deleted partition is called the source partition, and the partition that receives the data is called the destination partition:

The compression mode of the target partition

How data is merged into the target partition

NONE

During Merger, the data is extracted to the target partition

ROW

During Merger, the data is converted to row compression format

PAGE

-heap table: during Merger, the data is converted to row compression format

-clustered index: during Merger, the data is converted to page compression format

PS: partition table Merger operation rules

1. When LEFT RANGE, delete the partition where the boundary value is located, keep the partition on the "left" side, and move the data to it.

2. When RIGHT RANGE, delete the partition where the boundary value is located, keep the "right" partition, and move the data to it

8. Data compression and transparent data encryption (TDE)

TDE is encrypted when data pages are written to disk and decrypted when pages are read from disk and put into memory. The data compression / decompression operation is performed on pages in memory, so decrypted pages are always used for data compression / decompression. Therefore, the previous interaction between the two is very small.

Summary

1. This paper is based on the brief translation and summary of the white paper. This white paper is based on SQL Server 2008.

two。 Data compression is an undervalued SQL Server technology, and I personally think it is necessary to use it as one of the standardized best practices.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.