PostgreSQL Source Code interpretation (211)-background process # 10 (checkpointer-BufferSync) 07/08 Update SLTechnology News&Howtos

PostgreSQL Source Code interpretation (211)-background process # 10 (checkpointer-BufferSync)

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This section introduces the function for flushing disk in checkpoint: BufferSync, which Write out all dirty buffers in the pool (persists all dirty pages in the buffer pool to physical storage).

It is worth mentioning that checkpoint will only deal with dirty pages at the beginning of checkpoint (marked BM_CHECKPOINT_NEEDED), not page that gets dirty at checkpoint.

I. data structure

Macro definition

Checkpoints request flag bits, checkpoint request tag bit definition.

/ * OR-able request flag bits for checkpoints. The "cause" bits are used only * for logging purposes. Note: the flags must be defined so that it's * sensible to OR together request flags arising from different requestors. * / / * These directly affect the behavior of CreateCheckPoint and subsidiaries * / # define CHECKPOINT_IS_SHUTDOWN 0x0001 / * Checkpoint is for shutdown * / # define CHECKPOINT_END_OF_RECOVERY 0x0002 / * Like shutdown checkpoint, but * issued at end of WAL recovery * / # define CHECKPOINT_IMMEDIATE 0x0004 / * Do it without delays * / # define CHECKPOINT_FORCE 0x0008 / * Force even if no activity * / # define CHECKPOINT_FLUSH_ALL 0x0010 / * Flush all pages Including those * belonging to unlogged tables * / / * These are important to RequestCheckpoint * / # define CHECKPOINT_WAIT 0x0020 / * Wait for completion * / # define CHECKPOINT_REQUESTED 0x0040 / * Checkpoint request has been made * / / * These indicate the cause of a checkpoint request * / # define CHECKPOINT_CAUSE_XLOG 0x0080 / * XLOG consumption * / # define CHECKPOINT_CAUSE_TIME 0x0100 / * Elapsed time * / II. Source code interpretation

BufferSync: persist all dirty pages in the buffer pool to physical storage. The main logic is as follows:

1. Perform relevant checks, such as ensuring the correctness of calling SyncOneBuffer functions, etc.

two。 Set the mask tag according to the checkpoint tag (if it is XX, unlogged buffer will also flush)

3. Traversing the cache, using BM_CHECKPOINT_NEEDED to mark the cache page; that needs to be flushed. If there is no page to be processed, return

4. Sort dirty pages that need to be brushed to avoid random IO and improve performance

5. Assign progress status to each tablespace that needs to brush dirty pages

6. Build a minimum heap on the write progress of a single tag and calculate the percentage of a single processing buffer

7. If ts_heap is not empty, loop processing

7.1get buf_id

7.2 call SyncOneBuffer to swipe the disk

7.3.Call CheckpointWriteDelay and hibernate to control the frequency of iUnip O

7.4 release resources and update statistics

/ * BufferSync-Write out all dirty buffers in the pool. * persist all dirty pages in the buffer pool to physical storage. * * This is called at checkpoint time to write out all dirty shared buffers. * The checkpoint request flags should be passed in. If CHECKPOINT_IMMEDIATE * is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN, * CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even * unlogged buffers, which are otherwise skipped. The remaining flags * currently have no effect here. * this function brushes all dirty pages in the buffer pool to disk during checkpoint. * the input parameter is the checkpoint request flag. * if the request is marked as CHECKPOINT_IMMEDIATE, delay is disabled during writing; * if it is CHECKPOINT_IS_SHUTDOWN/CHECKPOINT_END_OF_RECOVERY/CHECKPOINT_FLUSH_ALL, * the unlogged cache that would normally be ignored will be written to disk. * other marks have no effect here. * / static voidBufferSync (int flags) {uint32 buf_state; int buf_id; int num_to_scan; int num_spaces; int num_processed; int num_written; CkptTsStatus * per_ts_stat = NULL; Oid last_tsid; binaryheap * ts_heap; int i; int mask = BM_DIRTY; WritebackContext wb_context / * Make sure we can handle the pin inside SyncOneBuffer * / / make sure that the pin page ResourceOwnerEnlargeBuffers (CurrentResourceOwner) in the SyncOneBuffer function can be processed; / * * Unless this is a shutdown checkpoint or we have been explicitly told, * we write only permanent, dirty buffers. But at shutdown or end of * recovery, we write all dirty buffers. * / / if CHECKPOINT_IS_SHUTDOWN/CHECKPOINT_END_OF_RECOVERY/CHECKPOINT_FLUSH_ALL, / / even if unlogged cache is ignored normally, it will be written to disk. If (! ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_FLUSH_ALL) mask | = BM_PERMANENT; / * * Loop over all buffers, and mark the ones that need to be written with * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_scan), so that we * can estimate how much work needs to be done. * traversing the cache, using BM_CHECKPOINT_NEEDED to mark the page that needs to be written. * count these pages to estimate how much work needs to be done. * This allows us to write only those pages that were dirty when the * checkpoint began, and not those that get dirtied while it proceeds. * Whenever a page with BM_CHECKPOINT_NEEDED is written out, either by us * later in this function, or by normal backends or the bgwriter cleaning * scan, the flag is cleared. Any buffer dirtied after this point won't * have the flag set. * you only need to write dirty pages at the beginning of checkpoint, and you don't need to include page that gets dirty during checkpoint. * once the dirty page marked BM_CHECKPOINT_NEEDED is flushed, the flag will be reset in the subsequent processing logic of this function or the normal background process / bgwriter process. * all dirty pages at that time will not be set to BM_CHECKPOINT_NEEDED. * Note that if we fail to write some buffer, we may leave buffers with * BM_CHECKPOINT_NEEDED still set. This is OK since any such buffer would * certainly need to be written for the next checkpoint attempt, too. * it should be noted that there is an error in flushing the dirty page. The mark of the dirty page is still BM_CHECKPOINT_NEEDED. The next time checkpoint tries to flush the disk again. * / num_to_scan = 0; for (buf_id = 0; buf_id)

< NBuffers; buf_id++) { BufferDesc *bufHdr = GetBufferDescriptor(buf_id); /* * Header spinlock is enough to examine BM_DIRTY, see comment in * SyncOneBuffer. */ buf_state = LockBufHdr(bufHdr); if ((buf_state & mask) == mask) { CkptSortItem *item; buf_state |= BM_CHECKPOINT_NEEDED; item = &CkptBufferIds[num_to_scan++]; item->

Buf_id = buf_id; item- > tsId = bufHdr- > tag.rnode.spcNode; item- > relNode = bufHdr- > tag.rnode.relNode; item- > forkNum = bufHdr- > tag.forkNum; item- > blockNum = bufHdr- > tag.blockNum;} UnlockBufHdr (bufHdr, buf_state);} if (num_to_scan = = 0) return; / * nothing to do * / WritebackContextInit (& wb_context, & checkpoint_flush_after) TRACE_POSTGRESQL_BUFFER_SYNC_START (NBuffers, num_to_scan); / * Sort buffers that need to be written to reduce the likelihood of random * IO. The sorting is also important for the implementation of balancing * writes between tablespaces. Without balancing writes we'd potentially * end up writing to the tablespaces one-by-one; possibly overloading the * underlying system. * sort dirty pages that need to be flushed to avoid random IO. * / qsort (CkptBufferIds, num_to_scan, sizeof (CkptSortItem), ckpt_buforder_comparator); num_spaces = 0; / * Allocate progress status for each tablespace with buffers that need to * be flushed. This requires the to-be-flushed array to be sorted. * assign progress status to each table space that needs to brush dirty pages. * / last_tsid = InvalidOid; for (I = 0; I

< num_to_scan; i++) { CkptTsStatus *s; Oid cur_tsid; cur_tsid = CkptBufferIds[i].tsId; /* * Grow array of per-tablespace status structs, every time a new * tablespace is found. */ if (last_tsid == InvalidOid || last_tsid != cur_tsid) { Size sz; num_spaces++; /* * Not worth adding grow-by-power-of-2 logic here - even with a * few hundred tablespaces this should be fine. */ sz = sizeof(CkptTsStatus) * num_spaces; if (per_ts_stat == NULL) per_ts_stat = (CkptTsStatus *) palloc(sz); else per_ts_stat = (CkptTsStatus *) repalloc(per_ts_stat, sz); s = &per_ts_stat[num_spaces - 1]; memset(s, 0, sizeof(*s)); s->

TsId = cur_tsid; / * * The first buffer in this tablespace. As CkptBufferIds is sorted * by tablespace all (s-> num_to_scan) buffers in this tablespace * will follow afterwards. * / s-> index = I; / * progress_slice will be determined once we know how many buffers * are in each tablespace, i.e. After this loop. * / last_tsid = cur_tsid;} else {s = & per_ts_ stat [num _ spaces-1];} s-> num_to_scan++;} Assert (num_spaces > 0); / * * Build a min-heap over the write-progress in the individual tablespaces, * and compute how large a portion of the total progress a single * processed buffer is. * build a minimum heap on the write progress of a single tag and calculate the percentage of a single processing buffer. * / ts_heap = binaryheap_allocate (num_spaces, ts_ckpt_progress_comparator, NULL); for (I = 0; I

< num_spaces; i++) { CkptTsStatus *ts_stat = &per_ts_stat[i]; ts_stat->

Progress_slice = (float8) num_to_scan / ts_stat- > num_to_scan; binaryheap_add_unordered (ts_heap, PointerGetDatum (ts_stat));} binaryheap_build (ts_heap); / * Iterate through to-be-checkpointed buffers and write the ones (still) * marked with BM_CHECKPOINT_NEEDED. The writes are balanced between * tablespaces; otherwise the sorting would lead to only one tablespace * receiving writes at a time, making inefficient use of the hardware. * iteratively process to-be-checkpointed buffers and dirty pages. * writes between tablespaces are balanced. * / num_processed = 0; num_written = 0; while (! binaryheap_empty (ts_heap)) {BufferDesc * bufHdr = NULL; CkptTsStatus * ts_stat = (CkptTsStatus *) DatumGetPointer (binaryheap_first (ts_heap)); buf_id = CkptBufferIds [ts _ stat- > index] .buf _ id; Assert (buf_id! =-1); bufHdr = GetBufferDescriptor (buf_id); num_processed++ / * We don't need to acquire the lock here, because we're only looking * at a single bit. It's possible that someone else writes the buffer * and clears the flag right after we check, but that doesn't matter * since SyncOneBuffer will then do nothing. However, there is a * further race condition: it's conceivable that between the time we * examine the bit here and the time SyncOneBuffer acquires the lock, * someone else not only wrote the buffer but replaced it with another * page and dirtied it. In that improbable case, SyncOneBuffer will * write the buffer though we didn't need to. It doesn't seem worth * guarding against this, though. * / if (pg_atomic_read_u32 (& bufHdr- > state) & BM_CHECKPOINT_NEEDED) {/ / only processes page marked BM_CHECKPOINT_NEEDED / / calls SyncOneBuffer flushing disk (one page at a time) if (SyncOneBuffer (buf_id, false, & wb_context) & BUF_WRITTEN) {TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN (buf_id); BgWriterStats.m_buf_written_checkpoints++ Num_written++;}} / * Measure progress independent of actually having to flush the buffer *-otherwise writing become unbalanced. * / ts_stat- > progress + = ts_stat- > progress_slice; ts_stat- > num_scanned++; ts_stat- > index++; / * Have all the buffers from the tablespace been processed? * / if (ts_stat- > num_scanned = = ts_stat- > num_to_scan) {binaryheap_remove_first (ts_heap) } else {/ * update heap with the new progress * / binaryheap_replace_first (ts_heap, PointerGetDatum (ts_stat));} / * * Sleep to throttle our Sleep to throttle our O rate. * dormancy: control the frequency of CheckpointWriteDelay O * / CheckpointWriteDelay (flags, (double) num_processed / num_to_scan);} / * issue all pending flushes * / IssuePendingWritebacks (& wb_context); pfree (per_ts_stat); per_ts_stat = NULL; binaryheap_free (ts_heap); / * Update checkpoint statistics. As noted above, this doesn't include * buffers written by other backends or bgwriter scan. * / CheckpointStats.ckpt_bufs_written + = num_written; TRACE_POSTGRESQL_BUFFER_SYNC_DONE (NBuffers, num_written, num_to_scan);} III. Tracking analysis

Test script

Testdb=# update t_wal_ckpt set c2 = 'C4pm' | | substr (c2mem4 and UPDATE 1testdb=# checkpoint)

Tracking and analysis

(gdb) handle SIGINT print nostop passSIGINT is used by the debugger.Are you sure you want to change it? (y or n) ySignal Stop Print Pass to program DescriptionSIGINT No Yes Yes Interrupt (gdb) b CheckPointGutsBreakpoint 1 at 0x56f0ca: file xlog.c, line 8968. (gdb) cContinuing.Program received signal SIGINT, Interrupt.Breakpoint 1, CheckPointGuts (checkPointRedo=16953420440, flags=108) at xlog.c:89688968 CheckPointCLOG (); (gdb) n8969 CheckPointCommitTs (); (gdb) 8970 CheckPointSUBTRANS (); (gdb) 8971 CheckPointMultiXact (); (gdb) 8972 CheckPointPredicate (); (gdb) 8973 CheckPointRelationMap (); (gdb) 8974 CheckPointReplicationSlots () (gdb) 8975 CheckPointSnapBuild (); (gdb) 8976 CheckPointLogicalRewriteHeap (); (gdb) 8977 CheckPointBuffers (flags); / * performs all required fsyncs * / (gdb) stepCheckPointBuffers (flags=108) at bufmgr.c:25832583 TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START (flags); (gdb) n2584 CheckpointStats.ckpt_write_t = GetCurrentTimestamp (); (gdb) 2585 BufferSync (flags); (gdb) stepBufferSync (flags=108) at bufmgr.c:17931793 CkptTsStatus * per_ts_stat = NULL (gdb) p flags$1 = 108 (gdb) n1797 int mask = BM_DIRTY; (gdb) 1801 ResourceOwnerEnlargeBuffers (CurrentResourceOwner); (gdb) 1808 if (!) ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY | (gdb) 1810 mask | = BM_PERMANENT; (gdb) 1828 num_to_scan = 0; (gdb) 1829 for (buf_id = 0; buf_id)

< NBuffers; buf_id++)(gdb) 1831 BufferDesc *bufHdr = GetBufferDescriptor(buf_id);(gdb) 1837 buf_state = LockBufHdr(bufHdr);(gdb) p buf_id$2 = 0(gdb) p NBuffers$3 = 65536(gdb) n1839 if ((buf_state & mask) == mask)(gdb) 1853 UnlockBufHdr(bufHdr, buf_state);(gdb) 1829 for (buf_id = 0; buf_id < NBuffers; buf_id++)(gdb) 1831 BufferDesc *bufHdr = GetBufferDescriptor(buf_id);(gdb) 1837 buf_state = LockBufHdr(bufHdr);(gdb) 1839 if ((buf_state & mask) == mask)(gdb) 1853 UnlockBufHdr(bufHdr, buf_state);(gdb) 1829 for (buf_id = 0; buf_id < NBuffers; buf_id++)(gdb) b bufmgr.c:1856Breakpoint 2 at 0x8a68b3: file bufmgr.c, line 1856.(gdb) cContinuing.Breakpoint 2, BufferSync (flags=108) at bufmgr.c:18561856 if (num_to_scan == 0)(gdb) p num_to_scan$4 = 1(gdb) n1859 WritebackContextInit(&wb_context, &checkpoint_flush_after);(gdb) 1861 TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);(gdb) 1870 qsort(CkptBufferIds, num_to_scan, sizeof(CkptSortItem),(gdb) 1873 num_spaces = 0;(gdb) 1879 last_tsid = InvalidOid;(gdb) 1880 for (i = 0; i < num_to_scan; i++)(gdb) 1885 cur_tsid = CkptBufferIds[i].tsId;(gdb) 1891 if (last_tsid == InvalidOid || last_tsid != cur_tsid)(gdb) p cur_tsid$5 = 1663(gdb) n1895 num_spaces++;(gdb) 1901 sz = sizeof(CkptTsStatus) * num_spaces;(gdb) 1903 if (per_ts_stat == NULL)(gdb) 1904 per_ts_stat = (CkptTsStatus *) palloc(sz);(gdb) 1908 s = &per_ts_stat[num_spaces - 1];(gdb) p sz$6 = 40(gdb) p num_spaces$7 = 1(gdb) n1909 memset(s, 0, sizeof(*s));(gdb) 1910 s->

TsId = cur_tsid; (gdb) 1917 s-> index = I; (gdb) 1924 last_tsid = cur_tsid; (gdb) 1892 {(gdb) 1931 s-> num_to_scan++; (gdb) 1880 for (I = 0; I

< num_to_scan; i++)(gdb) 1934 Assert(num_spaces >

0); (gdb) 1941 ts_heap = binaryheap_allocate (num_spaces, (gdb) 1945 for (I = 0; I

< num_spaces; i++)(gdb) 1947 CkptTsStatus *ts_stat = &per_ts_stat[i];(gdb) 1949 ts_stat->

Progress_slice = (float8) num_to_scan / ts_stat- > num_to_scan; (gdb) 1951 binaryheap_add_unordered (ts_heap, PointerGetDatum (ts_stat)); (gdb) 1945 for (I = 0; I

< num_spaces; i++)(gdb) 1954 binaryheap_build(ts_heap);(gdb) 1962 num_processed = 0;(gdb) p *ts_heap$8 = {bh_size = 1, bh_space = 1, bh_has_heap_property = true, bh_compare = 0x8aa0d8 , bh_arg = 0x0, bh_nodes = 0x2d666d8}(gdb) n1963 num_written = 0;(gdb) 1964 while (!binaryheap_empty(ts_heap))(gdb) 1966 BufferDesc *bufHdr = NULL;(gdb) 1968 DatumGetPointer(binaryheap_first(ts_heap));(gdb) 1967 CkptTsStatus *ts_stat = (CkptTsStatus *)(gdb) 1970 buf_id = CkptBufferIds[ts_stat->

Index] .buf _ id; (gdb) 1971 Assert (buf_id! =-1); (gdb) p buf_id$9 = 160 (gdb) n1973 bufHdr = GetBufferDescriptor (buf_id); (gdb) 1975 num_processed++ (gdb) 1989 if (& bufHdr- > state) & BM_CHECKPOINT_NEEDED) (gdb) p * bufHdr$10 = {tag = {rnode = {spcNode = 1663, dbNode = 16384, relNode = 221290}, forkNum = MAIN_FORKNUM, blockNum = 0}, buf_id = 160,state = {value = 3549691904}, wait_backend_pid = 0, freeNext =-2, content_lock = {tranche = 53, state = {value = 536870912}, waiters = {head = 2147483647 Tail = 2147483647}} (gdb) n1991 if (SyncOneBuffer (buf_id, false, & wb_context) & BUF_WRITTEN) (gdb) 1993 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN (buf_id) (gdb) 1994 BgWriterStats.m_buf_written_checkpoints++; (gdb) 1995 num_written++; (gdb) 2003 ts_stat- > progress + = ts_stat- > progress_slice; (gdb) 2004 ts_stat- > num_scanned++; (gdb) 2005 ts_stat- > index++; (gdb) 2008 if (ts_stat- > num_scanned = = ts_stat- > num_to_scan) (gdb) 2010 binaryheap_remove_first (ts_heap) (gdb) 2021 CheckpointWriteDelay (flags, (double) num_processed / num_to_scan); (gdb) 1964 while (! binaryheap_empty (ts_heap)) (gdb) 2025 IssuePendingWritebacks (& wb_context); (gdb) 2027 pfree (per_ts_stat); (gdb) 2028 per_ts_stat = NULL; (gdb) 2029 binaryheap_free (ts_heap); (gdb) 2035 CheckpointStats.ckpt_bufs_written + = num_written (gdb) 2037 TRACE_POSTGRESQL_BUFFER_SYNC_DONE (NBuffers, num_written, num_to_scan); (gdb) 2038} (gdb) CheckPointBuffers (flags=108) at bufmgr.c:25862586 CheckpointStats.ckpt_sync_t = GetCurrentTimestamp (); (gdb) IV.

PG Source Code

Analysis of the characteristics of PgSQL and the scheduling of checkpoint

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.