In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Original link: http://www.postgres.cn/v2/news/viewone/1/385?tdsourcetag=s_pcqq_aiomsg
I'm a very kind person. You have to leave a link.
Thank the original author for sharing.
Exploring the WAL Secret of PostgreSQL transaction Log (part I)
Original author: he Xiaodong (EthanHE) creation time: 2019-01-02 09 03purl 4608Compender: redraiment
Release time: 09:03:46, 2019-01-02
You are welcome to contribute. Email: press@postgres.cn
Comments: browse: 234
Abstract
Transaction logs are an important part of the database, storing the history of all changes and operations in the database system to ensure that the database does not lose data due to failures, such as power outages or other failures that cause the server to crash. In PostgreSQL (hereinafter referred to as PG), the transaction log file is called Write Ahead Log (hereinafter referred to as WAL).
This paper briefly analyzes the structure of transaction log files in PG, including the basic terms of WAL, the composition of WAL files, the internal structure and content analysis of WAL segment file, the memory organization of XLOG Record and the brief introduction of pg_waldump tools. This is the first part, including the basic terminology of WAL, the composition of WAL files, and the internal structure of WAL segment file.
I. basic terminology of WAL
In order to better understand WAL and facilitate communication, it is necessary to briefly introduce the relevant WAL terms.
1 、 REDO log
Redo log is often referred to as the redo log, and each change is written to the Redo log before it is written to the data file. Its purpose and significance is to store all the modification history of the database for database failure recovery (Recovery), incremental backup (Incremental Backup), PITR (Point In Time Recovery), and replication (Replication).
2 、 WAL segment file
For ease of management, PG divides the transaction log file into N segment, each segment is called WAL segment file, and each WAL segment file size defaults to 16MB.
3 、 XLOG Record
This is a logical concept, which can be understood as that every change in PG corresponds to a XLOG Record, and these XLOG Record are stored in WAL segment file. PG reads these XLOG Record for operations such as recovery / PITR.
4 、 WAL buffer
WA buffers, whether WAL segment file's header or XLOG Record, are first written to the WAL buffer and then written to the WAL segment file through WAL writer "at the right time."
5 、 LSN
LSN is the log serial number Log Sequence Number. Indicates that XLOG record records are written to the location in the transaction log. The value of LSN is unsigned 64-bit integer (uint64). In the transaction log, LSN is monotonously incremented and unique.
6 、 checkpointer
Checkpointer is a background process in PG that periodically executes checkpoint. When the checkpoint is executed, the process writes the XLOG Record containing the checkpoint information to the current WAL segment file, and the XLOG Record record contains the location of the latest Redo pint.
7 、 checkpoint
Checkpoint checkpoint is performed by the checkpointer process, and the main processing flow is as follows:
Get Redo point, construct a XLOG Record containing this Redo point checkpoint (see Checkpoint structure for details) and write it to WAL segment file; refresh Dirty Page to disk; update Redo point and other information to pg_control file.
8 、 REDO point
REDO point is the starting point of PG startup recovery, and is the location at the end of the transaction log file when the last checkpoint was started, that is, when the Checkpoint XLOG Record was written (here, the location can be understood as the offset in the transaction log file).
9 、 pg_control
Pg_control is a physical file on disk that holds the basic information of checkpoints and is used in database recovery. You can view the contents of this file by using the command pg_controldata.
Second, the composition of WAL files
As mentioned earlier, the transaction log stores the history of all changes and operations in the database system. Is there a size limit for the transaction log as the database runs and the size of the transaction log grows? In PG, the answer is yes: there is a limit to size.
PG uses unsigned 64bit integers (uint64) as the addressing space for transaction log files. In theory, the maximum transaction log space for PG is 2 ^ 64Bytes (that is, 16EB). How big is this size? Assuming that a database is busy and can generate 16TB log files every day, the time it takes to reach the upper limit of transaction log file size is 1024 transactions 1024 ≈ 365 days log 2800 years. In other words, although the size is limited, it is sufficient at this stage.
Obviously, for 16EB files, OS cannot be managed efficiently, so PG divides the transaction log file into N WAL segment file with a size of 16m (default), and its overall structure is shown in the following figure:
Figure 1 overall structure of transaction log
1 、 WAL segment file
The name of the WAL segment file file is 24 characters, consisting of three parts, each of which is 8 characters, and each character is a hexadecimal value (that is, zero F). The parsing of each section is as follows (if the WAL segment file file size is 16MB):
The first part is TimeLineID, the value range is 0x00000000-> 0xFFFFFFFF, the second part is logical file ID, the value range is 0x00000000-> 0xFFFFFFFF, the third part is physical file ID, the value range is 0x00000000-> 0x000000FF.
The combination of logical file ID, physical file ID and file size realizes the search space of 64bit:
The logical file ID is the uint32 of 32bit (unsigned int 32bit) the physical file ID is the unit816M of 8bit and the file size is the unit24 of 24bit
Together, the three form unit64 (32-8-24), which achieves the maximum file addressing space of 64bit.
2. Talk about LSN again
The LSN of the transaction log file indicates the location where the XLOG Record record is written to the transaction log file. LSN can be understood as the Offset of XLOG Record in the transaction log file.
LSN consists of three parts, namely, logical file ID, physical file ID and intra-file offset. Such as LSN:1/4288E228, where 1 is the logical file ID,42 is the physical file ID,88E228 is the offset within the WAL segment file file (Note: the search space of 3Bytes is 16MB).
According to this rule, given a LSN, it is easy to calculate the corresponding log file based on the LSN number (assuming the timeline TimeLineID is 1).
For example, the WAL segment file file corresponding to LSN 1/4288E228 is 00000001 00000001 00000042. The first 8 bits of the file name are timeline ID (00000001), the middle 8 bits (00000001) are logical file ID, and the last 8 bits (00000042) are physical file ID.
In addition, PG also provides the corresponding function to obtain the log file name based on LSN:
Testdb=# SELECT pg_walfile_name ('1shock 4288E228'); pg_walfile_name-- 000000010000000100000042 (1 row) III. Internal structure of WAL segment file
The default size of WAL segment file is 16MB, and its internal structure is shown in the following figure:
Fig. 2 Internal structure of WAL segment file
1 、 WAL segment file
WAL segment file is divided into N page (Block), each page size is 8192 Bytes or 8K, the corresponding data structure of the header of the first page of each WAL segment file in the PG source code is XLogLongPageHeaderData, and the corresponding data structure of the header of other page is XLogPageHeaderData. In one page, page header is followed by N XLOG Record.
2 、 XLOG Record
XLOG Record consists of two parts, the first part is the header information of XLOG Record, the size is fixed (24 Bytes), the corresponding structure is XLogRecord;, the second part is XLOG Record data.
The overall layout of XLOG Record is as follows:
Header data (fixed-size XLogRecord structure) XLogRecordBlockHeader structure XLogRecordBlockHeader structure. XLogRecordDataHeader [short | Long] structure block datablock data...main data
XLOG Record can be divided into three categories according to the content of the stored data:
Record for backup block: the block that stores the full-write-page. This type of Record is designed to solve the problem of partially written page. Modify the data page for the first time after the completion of the checkpoint, and write it to the transaction log file when recording this change (you need to set the appropriate initialization parameters, which is turned on by default); Record for tuple data block: store tuple changes in page, using this type of Record record; Record for Checkpoint: record checkpoint information (including Redo point) in the transaction log file when checkpoint occurs.
Where XLOG Record data is the place where the actual data is stored, it consists of the following parts:
0.. N XLogRecordBlockHeader, each XLogRecordBlockHeader corresponds to one block data XLogRecordDataHeader [Short | Long], such as the data size 0xD98 uint16 xlp_magic; / * magic value for correctness checks * / / tag bit (see below) uint16 xlp_info; / * flag bits, the TimeLineID of the first XLOG Record in see below * / page, with the type uint32 TimeLineID xlp_tli The XLOG address of / * TimeLineID of first record on page * / / page (offset in the transaction log). The type is uint64 XLogRecPtr xlp_pageaddr; / * XLOG address of this page * / / * When there is not enough space on current page for whole record, we * continue on the next page. Xlp_rem_len is the number of bytes * remaining from a previous page. * if the current page does not have enough space to store the entire XLOG Record, store the remaining data in the next page * xlp_rem_len represents the size of the rest of the XLOG Record on the previous page * * Note that xl_rem_len includes backup-block data; that is, it tracks * xl_tot_len not xl_len in the initial header. Also note that the * continuation data isn't necessarily aligned. * Note that xl_rem_len contains backup-block data (full-page-write); * that is, xl_tot_len is tracked instead of xl_len in the initial header information. * it is also important to note that the remaining data does not need to be aligned. * / / there is not enough space on the previous page to store the XLOG Record. The Record continues to store the space occupied on this page uint32 xlp_rem_len; / * total len of remaining data for record * /} XLogPageHeaderData;#define SizeOfXLogShortPHD MAXALIGN (sizeof (XLogPageHeaderData)) typedef XLogPageHeaderData * XLogPageHeader
2. Definition of XLogLongPageHeaderData structure
/ * When the XLP_LONG_HEADER flag is set, we store additional fields in the * page header. (This is ordinarily done just in the first page of an * XLOG file.) The additional fields serve to identify the file accurately. * if the XLP_LONG_HEADER tag is set, additional fields are stored in the page header. * (usually exists in each transaction log file, that is, the first page of segment file). * additional fields are used to accurately identify the file. * / typedef struct XLogLongPageHeaderData {/ / the system identification code uint64 xlp_sysid; / * system identifier from pg_control * / Cross-check uint32 xlp_seg_size; / * just as a cross-check * / Cross-check uint32 xlp_xlog_blcksz in the standard header field XLogPageHeaderData std; / * system identifier from pg_control * / / pg_control / * just as a cross-check * /} XLogLongPageHeaderData;#define SizeOfXLogLongPHD MAXALIGN (sizeof (XLogLongPageHeaderData)) / / pointer typedef XLogLongPageHeaderData * XLogLongPageHeader / * When record crosses page boundary, set this flag in new page's header * / if XLOG Record crosses the page boundary Set the flag bit # define XLP_FIRST_IS_CONTRECORD 0x0001// in the new page header. The flag bit is marked as "long" header / * This flag indicates a "long" page header * / # define XLP_LONG_HEADER 0x0002 This flag indicates backup blocks starting in this page are optional * / / this flag bit indicates that the backup blocks from the beginning of the page is optional (not necessarily present) # define XLP_BKP_REMOVABLE 0x0004/ All the flag bits defined in / xlp_info (for page header validity check) / * All defined flag bits in xlp_info (used for validity checking of header) * / # define XLP_ALL_FLAGS 0x0007#define XLogPageHeaderSize (hdr)\ (hdr)-> xlp_info & XLP_LONG_HEADER)? SizeOfXLogLongPHD: SizeOfXLogShortPHD)
3. Definition of XLogRecord structure
/ * * The overall layout of an XLOG record is: * Fixed-size header (XLogRecord struct) * XLogRecordBlockHeader struct * XLogRecordBlockHeader struct *... * struct RecordDataHeader [short | Long] struct * block data * block data *... * main data * XLOG record is as follows: * fixed size head (XLogRecord structure) * XLogRecordBlockHeader Structure * XLogRecordBlockHeader structure *... * XLogRecordDataHeader [short | Long] structure * block data * block data *... * main data * * There can be zero or more XLogRecordBlockHeaders And 0 or more bytes of * rmgr-specific data not associated with a block. XLogRecord structs * always start on MAXALIGN boundaries in the WAL files, but the rest of * the fields are not aligned. * among them, XLogRecordBlockHeaders may have 0 or more bytes of rmgr-specific data that have nothing to do with block * XLogRecord is usually written at the MAXALIGN boundary of the WAL file, but the subsequent fields are not aligned * * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's * used to distinguish between block references, and the main data structs. * XLogRecordBlockHeader/XLogRecordDataHeaderShort/XLogRecordDataHeaderLong begins with a "id" that occupies 1 byte. * used to distinguish between block dependencies and main data structures. * / typedef struct XLogRecord {/ / record size uint32 xl_tot_len; / * total len of entire record * / / xact id TransactionId xl_xid; / * xact id * / points to the previous record XLogRecPtr xl_prev; / * ptr to previous record in log * / / identification bit in log, as described in uint8 xl_info below / * flag bits, see below * / / the resource manager of the record RmgrId xl_rmid; / * resource manager for this record * / / * 2 bytes of padding here, initialize to zero * / 2-byte crc check bit, initialized to 0 pg_crc32c xl_crc; / * CRC for this record * / / * XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding * / / followed by XLogRecordBlockHeaders and XLogRecordDataHeader} XLogRecord / / Macro definition: XLogRecord size # define SizeOfXLogRecord (offsetof (XLogRecord, xl_crc) + sizeof (pg_crc32c)) / * * The high 4 bits in xl_info may be used freely by rmgr. The * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by * XLogInsert caller. The rest are set internally by XLogInsert. * the high 4 bits of xl_info are freely used by rmgr. * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY are passed in by the caller of the XLogInsert function. * the rest is used internally by XLogInsert. * / # define XLR_INFO_MASK 0x0F#define XLR_RMGR_INFO_MASK 0xF0bind * * If a WAL record modifies any relation files, in ways not covered by the * usual block references, this flag is set. This is not used for anything * by PostgreSQL itself, but it allows external tools that read WAL and keep * track of modified blocks to recognize such special record types. Set this flag if the WAL record updates the storage file of the relationship in a special way (not involving normal block references). PostgreSQL itself does not use this method, but it allows external tools to read WAL and track modified blocks * to identify this particular record type. * / # define XLR_SPECIAL_REL_UPDATE 0x01 Enforces consistency checks of replayed WAL at recovery. If enabled, * each record will log a full-page write for each block modified by the * record and will reuse it afterwards for consistency checks. The caller * of XLogInsert can use this value if necessary, but if * wal_consistency_checking is enabled for a rmgr this is set unconditionally. * enforce a consistency check on recovery. * if this feature is enabled, each record will record a complete page write for each block modified by the record and reuse it later for consistency checking. * callers of XLogInsert can use this flag when needed, but if rmgr has wal_consistency_checking enabled, * unconditionally performs a consistency check. * / # define XLR_CHECK_CONSISTENCY 0x02
4. XLogRecordBlockHeader structure definition
/ * Header info for block data appended to an XLOG record. * append the header information of block data in XLOG record * * 'data_length' is the length of the rmgr-specific payload data associated * with this block. It does not include the possible full page image, nor * XLogRecordBlockHeader struct itself. * 'data_length' is the length of the rmgr-specific payload data associated with this block. * it does not include possible full page image, nor does it include the XLogRecordBlockHeader structure itself. * * Note that we don't attempt to align the XLogRecordBlockHeader struct! * So, the struct must be copied to aligned local storage before use. * Note: we are not going to try to align XLogRecordBlockHeader structures! * therefore, XLogRecordBlockHeader must be copied to aligned local storage before use. * / typedef struct XLogRecordBlockHeader {/ / Block references ID uint8 id; / * block reference ID * / / fork and flags uint8 fork_flags; / * fork within the relation, and flags * / / payload byte size uint16 data_length used in the relationship / * number of payload bytes (not including page * image) * / / * If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows * / * If BKPBLOCK_SAME_REL is not set, a RelFileNode follows * / / * BlockNumber follows * / / if BKPBLOCK _ HAS_IMAGE, followed by XLogRecordBlockImageHeader structure / / if BKPBLOCK _ SAME_REL is not set, RelFileNode / / BlockNumber} XLogRecordBlockHeader # define SizeOfXLogRecordBlockHeader (offsetof (XLogRecordBlockHeader, data_length) + sizeof (uint16))
5. XLogRecordDataHeader [short | Long] structure definition
/ * XLogRecordDataHeaderShort/Long are used for the "main data" portion of * the record. If the length of the data is less than 256bytes, the short * form is used, with a single byte to hold the length. Otherwise the long * form is used. * the "main data" part of the record used by XLogRecordDataHeaderShort/Long. * if the length of the data is less than 256 bytes, use a short format and save the length with one byte. * otherwise, use a long form. * (These structs are currently not used in the code, they are here just for * documentation purposes). * (these structures will no longer be used in code, here for documentation purposes) * / typedef struct XLogRecordDataHeaderShort {uint8 id; / * XLR_BLOCK_ID_DATA_SHORT * / uint8 data_length; / * number of payload bytes * /} XLogRecordDataHeaderShort;#define SizeOfXLogRecordDataHeaderShort (sizeof (uint8) * 2) typedef struct XLogRecordDataHeaderLong {uint8 id / * XLR_BLOCK_ID_DATA_LONG * / * followed by uint32 data_length, unaligned * / / followed by unsigned 32-bit integer data_length (unaligned)} XLogRecordDataHeaderLong;#define SizeOfXLogRecordDataHeaderLong (sizeof (uint8) + sizeof (uint32)) / * * Block IDs used to distinguish different kinds of record fragments. Block * references are numbered from 0 to XLR_MAX_BLOCK_ID. A rmgr is free to use * any ID number in that range (although you should stick to small numbers, * because the WAL machinery is optimized for that case). A couple of ID * numbers are reserved to denote the "main" data portion of the record. * Block id is used to distinguish different types of record fragments. * the block reference number is from 0 to XLR_MAX_BLOCK_ID. * rmgr is free to use any ID number in this range * (although you should insist on using smaller numbers because the WAL mechanism is optimized for this situation). * keep two ID numbers to represent the "main" data part of the record. * * The maximum is currently set at 32, quite arbitrarily. Most records only * need a handful of block references, but there are a few exceptions that * need more. * the current maximum is 32, which is very casual. * most records require only a few block references, but there are a few exceptions that require more. * / # define XLR_MAX_BLOCK_ID 32#define XLR_BLOCK_ID_DATA_SHORT 255#define XLR_BLOCK_ID_DATA_LONG 254#define XLR_BLOCK_ID_ORIGIN 253#endif / * XLOGRECORD_H * /
6. Definition of xl_heap_header structure
/ * We don't store the whole fixed part (HeapTupleHeaderData) of an inserted * or updated tuple in WAL; we can save a few bytes by reconstructing the * fields that are available elsewhere in the WAL record, or perhaps just * plain needn't be reconstructed. These are the fields we must store. * NOTE: t_hoff could be recomputed, but we may as well store it because * it will come for free due to alignment considerations. * PG does not store all the fixed parts (HeapTupleHeaderData) of inserted / updated tuples in WAL; * We can save some space by reconstructing some fields that are available in WAL records, or simply flattening processing. * these are the fields that we have to store. * Note: t_hoff can be recalculated, but we also need to store it because it will be destructed for alignment. * / typedef struct xl_heap_header {uint16 tincture infomask2 inceptionUniverse infomask2 marks uint16 tweets infomaskandracks infomasksUniverse infomask marks uint8 tinctures infomask2Universe infomask2 marks uint8 tweets infomask2 marks / / size of HeapHeader # define SizeOfHeapHeader (offsetof (xl_heap_header, t_hoff) + sizeof (uint8)) 7) xl_heap_insert structure definition / * * xl_heap_insert/xl_heap_multi_insert flag values, 8 bits are available. * / * PD_ALL_VISIBLE was cleared * / # define XLH_INSERT_ALL_VISIBLE_CLEARED (1
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.