What is the principle of idle block management mechanism in PostgreSQL? 07/13 Update SLTechnology News&Howtos

What is the principle of idle block management mechanism in PostgreSQL?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

What this article shares with you is about the principle of free data block management mechanism in PostgreSQL. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Generation of data block free space

According to PostgreSQL's MVCC mechanism, all UPDATE and DELETE operations generate expired data, which needs to be cleaned up through the vacuum command. There are basically two kinds of vacuum commands:

VACUUM

Mark the disk space for an expired tuple as available, but it does not really free up space for the operating system, and other programs can no longer use it. This operation does not require an exclusive lock (EXCLUSIVE LOCK) and does not affect table read and write operations.

VACUUM FULL

Copy the normal tuple data into the new disk file, reorganize, delete the original data file, and return the unused disk space to the operating system. The operation needs to acquire an exclusive lock, which will affect the normal read and write operation. Therefore, you need to be careful when performing this operation, especially when the amount of table data is large, the execution time will be longer.

We know that the table (Relation) of PostgreSQL is actually composed of multiple physical data blocks (pages). When the vacuum operation is performed, the disk space in these data blocks with expired records (tuple) is marked as available, resulting in free space.

When a new record (tuple) is added, priority is given to reusing the free space of the block in the table rather than allocating a new block. However, when multiple blocks have free space, which block should be selected to hold the new record? The selected record must have enough space to store the new record.

Organizational structure of free data blocks

To solve the above problems, PostgreSQL designed a FSM (Free Space Map) structure to represent the amount of free disk space in each data block. After the pg8.4 version, each table (Relation) has its own FSM space, which is represented as a physical file with the suffix _ fsm:

-bash-4.2$ cd $PGDATA/ins2/base-bash-4.2$ ll * fsm-rw- 1 postgres postgres 24576 Jun 26 15:40 1247_fsm-rw- 1 postgres postgres 24576 Jun 26 15:40 1249_fsm-rw- 1 postgres postgres 24576 Jun 26 15:40 1255_fsm

The storage structure of the FSM file is as follows:

In order to quickly find the appropriate data block and reduce the IO overhead caused by the search (that is, saving FSM file size), the FSM structure uses only one byte to record the free disk size in a data block. Because of 1byte=8bits, it can record 2 ^ 8 kinds of free disk sizes. Assuming that a block size (BLCKSZ) is 8k (postgreSQL default is 8k), it can be divided into 256 (22 ^ 8) and so on. Each copy has BLCKSZ/256 bytes to represent the range. Examples are as follows:

Range Category 0-31 0 32-63 1. 8096-8127 253 8128-8163 2545 8164-8192

Data structures within FSM data blocks

Knowing the representation of the amount of free space in the data block, how to organize these presentation records and maintain efficient query efficiency? The answer is that PostgreSQL uses a binary tree structure (large root heap) to store these records that represent the size of free space, leaf nodes store the actual space size records, and non-leaf nodes only serve as auxiliary queries. When you need to query whether there is a suitable data block size, you only need to compare the root node of the tree first, which greatly reduces the number of queries. Examples of large root heap data structures are as follows:

4 4 2 3 4 0 2 = 1626)? 3: 4) # define FSM_ROOT_LEVEL (FSM_TREE_DEPTH-1) # define FSM_BOTTOM_LEVEL 0 typedef struct {int level; / * level * / int logpageno; / * page number within the level * /} FSMAddress

Where level represents the layer number of the FSM data block, and logpageno represents the sequence number in that layer, starting with 0. Similar to the storage method within a single data block of FSM, records are actually stored only in the FSM data block of the * * layer (level=0). Other layers serve as the auxiliary layer of the query, and the leaf node value of the upper layer represents the root node value of the lower layer.

How many layers of logical structure does it take to represent all block records? the answer is that when more than 1626 records (map value) are stored in a FSM block, use three layers, because 162616261626 > = 2 ^ 32.

The following is a schematic diagram to show the overall organizational structure. In order to simplify the schematic diagram, only 4 bytes of data are stored in each data block in the diagram, which is consistent with the principle of storing 1626 bytes. The schematic diagram of the logical organization structure among the data blocks of the FSM file is as follows:

As shown in the figure, the leaf node value 123 in the layer 2 data block represents the root node value of the next layer (layer 1) No. 0 data block, while the leaf node value 123 of the layer 1 data block represents the root node of the layer 0 data block, and the leaf node value 123 of the layer 0 data block represents the data block with a free space size of [3936, 3967] bytes. Each data block has a logical address, for example, the logical address of data block 1 {1 FSM 0} represents the No. 0 FSM data block of layer 1, which is actually the No. 1 data block of the corresponding FSM physical file. The data stored in the layer 2 and layer 1 FSM data blocks are only used as secondary layer indexes. In fact, only the leaf nodes in the layer 0 FSM data blocks store the map values of the free data blocks in the table, and the other nodes are index values.

Search algorithm for idle data blocks

The representation of free data blocks and the organization of data blocks in FSM files are introduced above, and then the search algorithm of free space data blocks is introduced.

First of all, the search algorithm in FSM data blocks is introduced. For large root heap binary tree lookup, the simple method is to start from the root node every time. If the root node is less than the value to be found, it means that there is no map value in the block that meets the conditions, otherwise you can continue to find a leaf node that meets the conditions. However, the design of PostgreSQL is not like this. Instead, the fp_next_slot of the FSMPageData structure described earlier is used to save the starting point (slot) of the next query. The search algorithm is as follows:

Compare the root node value, and if the value to be queried is greater than the root node, it will be returned directly, indicating that there is no map value that meets the condition in the FSM data block, otherwise proceed to the next step.

Compare the map values corresponding to the starting position (slot) of the query, and if the condition is not met, proceed to the next step, otherwise skip to step 5.

Set the new query location to the parent of the next slot (the slot sequence number + 1, where the slott value represents the sequence number in the leaf node), and then compare, and repeat this step if the condition is not met until the root node is found up. If you find an intermediate node that meets the criteria, proceed to the next step.

Look down, find the leaf node that meets the criteria, and then move on to the next step.

Reset the fp_next_slot variable for the next query and return the slot of the leaf node.

The core source code of the FSM intra-block search algorithm is as follows:

The core source code of the FSM intra-block search algorithm is as follows: int fsm_search_avail (Buffer buf, uint8 minvalue, bool advancenext, bool exclusive_lock_held) {. Restart: if (fsmpage- > fp_nodes [0])

< minvalue) //每次查询先检查根节点是否满足条件 return -1; target = fsmpage->

Fp_next_slot; if (target

< 0 || target >

= LeafNodesPerPage) target = 0; target + = NonLeafNodesPerPage; nodeno = target; / / slot location while (nodeno > 0) {if (fsmpage- > fp_ [nodeno] > = minvalue) break; nodeno = parentof (rightneighbor (nodeno)); / / return the parent node location of the next slot} while (nodeno)

< NonLeafNodesPerPage) //向下查找到叶子节点 { int childnodeno = leftchild(nodeno); //先查看左子节点 if (childnodeno < NodesPerPage && fsmpage->

Fp_ Nodes [childnodeno] > = minvalue) {nodeno = childnodeno; continue;} childnodeno++; / / left child node does not meet the condition to find the right child node if (childnodeno)

< NodesPerPage && fsmpage->

Fp_ Nodes [childnodeno] > = minvalue) {nodeno = childnodeno;} else {/ / not found, indicating that there may be a "torn page" situation (crash appears when IO writes disk data, only part of the data is written) / / update the page data and then query. Fsm_rebuild_page (page); Goto restart;}} slot = nodeno-NonLeafNodesPerPage; / / find the slot sequence number fsmpage- > fp_next_slot = slot + (advancenext? 1: 0); / / Save the slot location where the next query starts return slot;}

So far, we have found the leaf node in the FSM data block that meets the conditions. If the page is not in layer 0, then the leaf node is not our final query target. According to the organizational structure of the aforementioned FSM data blocks, the leaf node in the auxiliary layer corresponds to the root node of the next layer FSM data block, so we need to continue to find the corresponding leaf node in layer 0. The data block corresponding to the next layer of the leaf node is calculated by the returned slot value. The source code of the core search algorithm is as follows:

For (;;) {. Slot = fsm_search_avail (buf, min_cat, (addr.level = = FSM_BOTTOM_LEVEL), false); If (slot! =-1) / / find the leaf node that meets the condition, otherwise exit the loop {if (addr.level = = FSM_BOTTOM_LEVEL) / / find layer 0 and return the result return fsm_get_heap_blk (addr, slot); addr = fsm_get_child (addr, slot); / / non-layer 0, continue to find the subtree}. } static FSMAddress fsm_get_child (FSMAddress parent, uint16 slot) {FSMAddress child; Assert (parent.level > FSM_BOTTOM_LEVEL); child.level = parent.level-1; child.logpageno = parent.logpageno * SlotsPerFSMPage + slot; / / find the logpageno return child; of the corresponding data page in the lower layer according to the slot of the upper layer

The whole search algorithm is introduced, but why use fp_next_slot as the starting query location instead of the root node? There are several reasons:

When multiple back-end connections add tuple at the same time, write conflicts on the same data block can be avoided as far as possible and write parallelism can be improved. If you start looking from the root node every time, it is possible that multiple queries find the same data block at the same time.

What is obtained is the adjacent data block that returned the query result last time, which is more helpful to improve the disk IO efficiency.

Update free block space size

When the appropriate free data block in the table is found, the new record is written to the data block, and then the free space size of the data block needs to be updated. Compared with search, update is relatively simple, the core idea is to recalculate the map value of the idle data block, then update the value of the corresponding leaf node in the FSM data block, and then update it up in a "bubbling" way until the value of the parent node does not change or the root node. The core source code is as follows:

Fsmpage- > fp_ [nodeno] = value; / / update the current node do {. Nodeno = parentof (nodeno); lchild = leftchild (nodeno); rchild = lchild + 1; newvalue = fsmpage- > fp_ [lchild]; if (rchild

< NodesPerPage) //右子节点存在，则选取***值作为父节点的新值 newvalue = Max(newvalue, fsmpage->

Fp_ Nodes [rchild]; oldvalue = fsmpage- > fp_ Nodes [nodeno]; if (oldvalue = = newvalue) / / check whether the parent node has changed after the update break; fsmpage- > fp_ Nodes [newvalue;] / / change, update the parent node, continue to update} while (nodeno > 0) / / Update to root node to exit

Lock

Searching for free blocks only adds a shared lock (shared buffer locks) to the currently searched FSM block, and an exclusive lock (exclusive buffer lock) when the FSM block is updated. It is worth noting that during the search, the fp_next_slot variable is used to indicate the starting point of the next search, without adding an exclusive lock, because the cost of maintaining an exclusive lock is much higher than that of an exception in the fp_next_slot variable.

These are the principles of the free data block management mechanism in PostgreSQL. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.