PostgreSQL's clog- starts with transaction rollback speed 07/09 Update SLTechnology News&Howtos

PostgreSQL's clog- starts with transaction rollback speed

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Original: https://www.enmotech.com/web/detail/1/701/1.html

If you are someone who studied other databases before, you will find it very strange to look at PostgreSQL: "the rollback of PostgreSQL is completed immediately and will not be affected by the transaction size itself."

What's weird about it? For example, I once encountered a failure of MySQL. A developer imported data into the production database using a Python script, but he did not notice one thing. By default, the MySQLdb of Python was set to autocommit, so this guy imported the data (the import here is not an ordinary kind of load data, but a SQL statement with business operations, so script operations are required). The script runs all day. The condition of the whole database becomes extremely bad: what he imports is the core table of a business, and a bunch of business operations need to operate this table, but after this import action runs for a day and occupies a large number of row locks (millions of row locks), the external services of the whole business system will be in a situation where locks cannot be obtained (mixed with potholes of MySQL gap locks), and business services will shut down. So, as far as DBA is concerned, the final decision is to kill this "big" business. After a kill command passed, the two of us DBA began to count slowly-small ants slowly climb-encounter a big bean sprout-encounter two big bean sprouts--

Finally, after nearly three hours of rollback, the transaction is rolled back and the business system is restored.

So when I saw this description of PostgreSQL, my immediate reaction was, why? How? What?

So there is this article, I start with the transaction visibility judgment of PG, and sort out the mechanism and function of PG core file clog.

Note: after pg 10, clog was renamed xact, mainly because many people habitually use * log to delete log files, always accidentally delete the original xlog and clog files, resulting in database unavailable, so renamed to wal and xact, later still use clog as the discussion word, need to pay attention.

Introduction to clog

The first question is, what is clog? Or to put it another way, what logs does PG have and what are they for?

In addition to the current text records (such as the database running error log and so on), there are mainly two binary log files in PG. One is the redo log corresponding to the traditional database theory. Theoretically, all data modification operations will be recorded in this log, and make sure that all operations are recorded on disk when the transaction is committed, so that even if downtime occurs. The database can also be revived without losing data.

However, each database has a different implementation at this point. For example, MySQL will have a binlog for master-slave synchronization across storage engines. In PG, master-slave synchronization has been synchronized through redo logs (XLOG in PG terminology). In order to deal with a series of problems caused by no undo, the visibility judgment function is handed over to clog log files to solve.

The Clog records the xid associated with each transaction (remember complaining about the freeze problem caused by the size of this thing) and the commit status of the transaction corresponding to the xid. The commit status includes the following: executing, committed, interrupted, committed sub-transaction. When you see here, you can see that as long as the state is set to committed when the transaction is committed and interrupted when the transaction is rolled back, the goal can be achieved. it does avoid the huge cost of suddenly rolling back transactions that operate on millions of rows.

But when I see this, I have a doubt. In this case, when I look up the data, after I see the xid of a row, I need to confirm its visibility immediately, so I need to check clog. This query is bound to be very frequent and random. How to solve this problem?

# define CLOG_BITS_PER_XACT 2

# define CLOG_XACTS_PER_BYTE 4

# define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)

# define CLOG_XACT_BITMASK ((1 RecordTransactionCommit (record transaction is committed)-> TransactionIdCommitTree (synchronously marks transaction as commit) / TransactionIdAsyncCommitTree (asynchronously marks transaction as commit. Lsn is required for calling the next step)-> TransactionIdSetTreeStatus (setting transaction and sub-transaction status)-> TransactionIdSetPageStatus (setting transaction status within a single data page)-> TransactionIdSetPageStatusInternal (setting actual file page)-> TransactionIdSetStatusBit (setting bit)

Among them, what is worth talking about is mainly the TransactionIdSetTreeStatus method.

This involves a concept, sub-transactions. In PG, the concept of sub-transaction mainly refers to: from the beginning to the end of the transaction, the period can be savepoint, and then rollback to savepoint rather than the starting point of the transaction, so there are many applications in the actual situation, so here the parent transaction and child transactions (such as the transaction is finally committed, but there is a rollback during the period, or multiple savepoint during the transaction) must be written as atomically as possible, otherwise the visibility of the transaction will be problematic.

In the code comments, a more intuitive example of the writing here is given:

For example, a transaction t has a subtransaction t _ 1, t _ 2, t _ 2, t _ 3, t _ 4, where t _ direction t _ 1 is mapped to the clog page, p _ 1Magee t _ 2 and t _ 3 are on page p2, and t _ 4 is on page p3. Then when writing, the order is as follows:

Set T2 / T3 of p2 as subsubmission, and then set T4 subsubmission of p3

Set T1 as a child submission, then t as submitted, and then set T1 as submitted

Set T2 T3 to submitted, set T4 bits to be submitted

For rollback, the TransactionIdSetTreeStatus method is actually called, except that the upper function is TransactionIdAbortTree and the flag is set to TRANSACTIONSTATUSABORTED, that is, recording the transaction as an interrupt. Semantically, for transaction interruptions, due to the atomicity of transactions, the interrupted transaction data is invisible, no problem.

Judgment of data Row transaction visibility and clog

As we all know, new rows in pg will delete the original row and write it next to the original row. Of course, each data row will record a transaction tag (and, of course, the corresponding transaction id for the data row) to ensure visibility and avoid seeing transactions that have been rollback at the transaction level.

First of all, at the time of writing, when the transaction is not finished, it must be in the state of "executing". When a transaction is committed or rolled back, pg will not change the tag back, otherwise it will be costly to commit or roll back.

As mentioned earlier, the transaction visibility of pg is to find the corresponding mark location in clog through the transaction id of the row, and then judge it. A very natural thing here is that this kind of judgment, one time for each line is enough. After a clear judgment, the transaction is modified to be marked as committed or interrupted, and there is no need to check clog for subsequent reads.

Of course that's what PG does.

In other words, all the modified data of the previous transaction did not change all the modification marks at the time of commit or rollback, but left the mess to later people.

There is also a question: now that you have changed the tag of the row, of course, the checksum of the data block in which the row is located has changed, and the checksum has changed, does that piece have to be passed to the wal cache to follow the process? Even if there are no changes to the data? And considering that when querying from the library, the query data can also directly follow the clog process of the slave database, does this data block have to be passed to the slave database?

So, now there is a ready-made interview question: will the simple select execution of PostgreSQL generate WAL logs?

In fact, the problem of checksum caused by the transaction marks here is quite special in PG.

In PostgreSQL, if and only if walloghints is set or initialized, and initdb is enabled with checksum, WAL logs will be written when the setting is marked.

And here is not every time the flag bit will be written.

The entire data block must be written to WAL only if the first time the data block has been modified after the previous checkpoint is a sethintbit operation.

Some derivative thoughts of clog

In fact, it cleans up expired data, and MySQL also uses delete+insert instead of update, but it does not cost as much as vacuum in cleaning and processing. For example, the execution of MySQL's purge thread generally requires little special attention. Although PostgreSQL's vacuum is parallelized, it is serial in a single table. Private contributions to intra-table parallel vacuum patches have been delayed because various bug have not been merged (PG12 is not going to happen at present) Why is this happening?

Because, after all, clog is only a mark of transaction visibility, not a modified association of transactions. In the traditional undo class implementation, the modified data and associated transactions are stored sequentially in undo and executed by purge, and the corresponding data blocks that need to be processed can be found directly from undo.

But for PG, because there are only transaction marks, vacuum must scan the data blocks of all data files to deal with this problem. Although vacuum and statistical information collection are integrated into one in pg (statistical information collection is the largest full-database scanning behavior of traditional database), the cost of scanning the whole database will not be less.

Therefore, vacuum is very slow for very large tables, and in extreme cases, it is highly likely that the entire library will be unavailable during vacuum freezen (no new transactions are allowed until the end of freezen).

In order to solve the problem of very large tables, it is traditionally recommended to use partitioned tables, but in the official implementation of PostgreSQL, partitioned tables have always been unstable and undersupported, so pathman, an external component, has to be introduced to coordinate processing, resulting in a further increase in the complexity of operation and maintenance.

However, at present, as far as PostgreSQL 12 is concerned, the interface at the storage engine level is gradually opening up, and although the undo version of the storage engine implemented in the community has not been released in this release because of the completion, it is expected that problems such as vacuum will be better handled in the future.

Want to know more about databases and cloud technologies?

Come and follow the official account of "data and Cloud" and the official website of "Yunhe Enmo". We look forward to learning and making progress with you!

(scan the QR code above and follow the official account of "data and Cloud" for more science and technology articles.)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.