What is the principle of SQLite atomic submission? 07/02 Update SLTechnology News&Howtos

What is the principle of SQLite atomic submission?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces you what is the principle of SQLite atomic submission, the content is very detailed, interested friends can refer to, hope to be helpful to you.

1.0 introduction

Atomic commit is an important feature of SQLite, a database that supports transactions. Atomic commit means that changes to the database in a transaction are complete or not at all. Atomic commit means that different writes are written to different parts of the database as if they were happening at the same point in time.

In fact, the hardware is continuously written to mass memory, but it takes very little time to write a sector. Therefore, it is possible to write to different parts of the data file at the same time or instantly. SQLite's atomic commit logic makes changes in a transaction as if they were happening at the same time.

The atom of transaction is an important feature of SQLite, even if the transaction is interrupted due to operating system error or power failure.

This article describes the techniques used by SQLite to implement atomic operations.

2.0 hardware Settings

In this previous article, we designated Mass Storage as "hard disk", even though it may be flash memory.

We assume that the hard disk is written as a block in sectors. We cannot modify the portion of the hard disk that is smaller than the sector alone. If you need to modify the part of the hard disk that is smaller than the sector, you must also read the entire sector into the sector, modify the sector, and then write the entire sector back to the hard disk.

In traditional Spinning disk, the sector is the smallest transmission unit-whether read or write. However, for flash memory, the minimum number of reads per read is usually much less than the minimum number of write operations. SQLite only cares about the minimum number of writes, so in this article, when we say "sectors", we mean the minimum total number of bytes written at a time.

In SQLite versions prior to 3.3.14, we assumed that in any case, a sector was 512 bytes. This is a value set at compile time and has never been tested for larger numbers. The 512-byte assumption makes sense when 512-byte sectors are used inside the disk drive. However, today's disks have grown to 4k per sector. Similarly, the sector size of an flash memory is usually greater than 512 bytes. So, starting with version 3.3.14, SQLite has a function to get the true size of the sector of the file system. In the current implementation (3.5.0), this function still simply returns 512-because in win32 and unix environments, there is no standard way to get the true size of the sector. But this approach makes a lot of sense when people need to adjust for their applications.

SQLite does not assume that sector writes are atomic. However, we assume that sector writes are linear. The so-called "linear" means that when starting a sector write operation, the hardware starts at the end of a sector, and then writes byte by byte to the end point of the sector. This write operation may be written from tail to head, or from beginning to end. If a power failure occurs during a sector write operation, part of the sector may have been modified and some have not yet been modified. The key setting of SQLite is this: if any part of a sector is modified, either the beginning or the end of it changes. So the hardware never starts writing in the middle of a sector. We don't know if this assumption is always true, but it seems reasonable anyway.

In the previous paragraph, SQLite does not assume that sector writes are atomic. In the SQLite3.5.0 version, a new VFS (virtual file system) interface has been added. SQLite interacts with the actual file system through VFS. SQLite has written a default VFS implementation for windows and unix. And allows users to implement a custom VFS implementation at run time. The VFS interface has a method called xDeviceCharacteristics. This method reads the various characteristics of the actual file system. The xDeviceCharacteristics method can indicate that the sector write operation is atomic, and if you do specify that the sector write is atomic, SQLite will not let go of this benefit. However, in windows and unix, the implementation of the default xDeviceCharacteristics does not specify that the sector write is atomic, so these optimizations are usually ignored.

SQLite assumes that the operating system buffers writes, so when the write request returns, it is possible that the data has not actually been written to storage. SQLite also assumes that such writes will be recorded by the operating system. Therefore, SQLite needs to make "flush" or "fsync" function calls at key points. SQLite assumes that flush or fsync will not return data until it is actually written to the hard disk. Unfortunately, we know that there is a lack of a real implementation of flush or fsync in some windows and unix versions. This causes the data file to be corrupted after a power failure occurred when SQLite wrote a submission. However, it doesn't matter, SQLite can do some testing or remediation. SQLite assumes that the operating system will run as beautifully as advertised. If none of this is a problem, then all that's left is to expect your power supply not to take an intermittent break.

SQLite assumes that the file growth pattern refers to the newly allocated file space, which is allocated randomly at first and then filled in with the actual data. In other words, the file gets bigger and then populates its contents. This is a pessimistic assumption, so SQLite has to do something extra to prevent the data file from being corrupted by a power outage-when the file size has increased and the file contents have not been fully populated. VFS's xDeviceCharacteristics can indicate whether the file system always writes data before changing the file size. (this is the: SQLITE_IOCAP_SAFE_APPEND property, if you want to see the code.)

When the xDeviceCharacteristics method instructs the contents of the file to be written before resizing, SQLite reduces some of the equivalent data protection and error handling, which greatly reduces a commit disk IO operation. In the current version, however, the VFS implementation of windows and unix does not assume this.

SQLite assumes that file deletion is atomic from the perspective of the user process. That is to say, when SQLite requests to delete a file, in the process of deletion, the power is cut off, and once the power is restored, only one of two things happens: the file still exists and nothing has changed; or the file has been deleted. If the file is only partially deleted after the power is restored, or if part of the content is changed or cleared, or if the file is just emptied, then the database is still useful.

SQLite assumes that it is the operating system or hardware that discovers or modifies other errors that may be caused by cosmic rays, thermal noise, quantum fluctuations, device driver bug, and so on. SQLite does not add any data redundancy to such issues. SQLite assumes that the data obtained by reading after writing is exactly the same as the written data!

3.0 single file submission

Let's take a look at the steps SQLite takes to ensure an atomic commit when targeting a database file. The techniques and specific file formats used between multiple database files to prevent power failure from damaging the database and to ensure atomicity of submissions are discussed in the next section.

3.1 initial state

The status diagram of the computer when a database is first opened is shown in the figure on the right. The rightmost part of the figure (the "Disk" annotation) represents the content saved in the storage device. Each box represents a sector. The blue block indicates that the sector holds the original data. The middle area in the figure is the disk buffer of the operating system. At the beginning of our case, these caches were not used yet-so these boxes were blank. The left area of the figure shows the memory of the SQLite user process. Because the database connection has just been opened, no data records have been read in, so the memory is also empty.

3.4 apply for a Reserved Lock

Before you can modify a database, SQLite must first have a "Reserved" lock for the database file. Reserved locks are similar to shared locks in that they allow other database connections to read information. Single Reserved

Locks can collaborate with multiple shared locks of other processes. Then only one Reserved can exist in a database file at a time. So only one process can try to write a database file at some point.

The existence of the Reserved lock announces that a process is going to update the database file, but it hasn't started yet. Because the modification has not started yet, other processes can read the data, but should not try to modify the database.

3.5 generate a rollback log file

Before modifying the database file, SQLite generates a separate rollback log file and writes in it the original data of the page to be modified. Rolling back the log file means that it will contain all the data that can restore the database file to its original state.

The rollback log file has a small header (marked in green in the figure) that records the original size of the database file. Therefore, once the database file gets larger, we will still know its original size. The modified page numbers and their contents in the database file are written to the rollback log file.

When a new file is created, most desktop operating systems (windows,linux,macOSX) do not actually write data to the hard drive immediately. This file still exists only in the operating system disk cache. This file is not immediately written to the storage device, there is usually some delay, or when the operating system is quite idle. The user's sense of file generation is much faster (first) than its actual disk Iplop O operation. We illustrate this with a legend in the figure. When a new rollback log file is created, it only appears in the operating system disk cache and has not actually been written to the hard disk.

3.8 get an Exclusive lock

Before modifying the database file itself, we must obtain an exclusive lock for this database file. The process of obtaining this lock is a two-step process. First SQLite acquires a "Reserved" lock, and then promotes the lock to an exclusive lock.

A critical lock allows all other processes that have acquired the shared lock to continue reading data from the database file. But it prevents the generation of new shared locks. That is, critical locks will prevent write opportunities from being denied due to a large number of consecutive read operations. These readers may have a dozen, hundreds, or even thousands. Any reader applies for a shared lock before starting to read, then starts reading the data it needs, and then releases the shared lock. However, there is a possibility that if there are too many processes reading the same data file, a new process will always apply for a shared lock before the old process releases its shared lock. therefore, there will be no shared lock on this database file at some point, and the writer will not have the opportunity to acquire a unique lock. The concept of critical lock enables existing readers to complete their reads while preventing new readers from reading, and after all readers have finished reading, the critical lock can be promoted to an exclusive lock.

3.10 Refresh changes to storage

An additional flush operation is necessary to ensure that changes to this file are actually written to permanent memory. This is also an important step to ensure that the data will be intact after a power outage. However, because of the inherent slowness of writing to disk, this step, like the log file flush to disk in section 3.7 above, takes up most of the time of SQLIite transaction commit operations.

3.11 Delete the rollback log file

When the data changes have been safely written to the hard disk, the rollback log file is no longer necessary, so it is deleted immediately. If the power goes off or the system crashes before deletion, the recovery process (mentioned later) writes the contents of the log file back to the database file-even if the database has not changed. If the system crashes or power goes out again after deletion, it looks as if all the changes have been written to disk. Therefore, SQLite's determination of whether the database file has completed the change depends on the existence of the rollback log file.

Deleting a file is not actually an atomic operation, but from the point of view of the user process, it is an atomic operation. A process can always ask the operating system whether a file exists, and the only answers it gets are "YES" and "NO". In the middle of a transaction commit, the system crashes or stops, and after that, SQLite asks the operating system whether the rollback log file exists. If so, the transaction is not completed, is interrupted, and the database file needs to be rolled back. If the log file does not exist, it means that the transaction has already committed the ok. .

The possibility of a transaction depends on whether there is a rollback log file. Deleting a file is atomic for a user process. Therefore, the whole transaction also appears to be an atomic operation. .

4.4 rollback uncompleted changes

Once the process acquires an exclusive lock, it is allowed to update the database file. Then read the original content from the log file and write it back to the database file. Remember that at the beginning of the aborted transaction, the original size of the database file was written into the header of the log file. SQLite uses this information to truncate the database file and restore the file to its original size-if the uncompleted transaction makes the database larger. Finally, the database file size and contents must be the same as before the interrupted transaction started.

6.0 some implementation details of atomic operation

Section 3. 0 outlines how atomic submissions work in SQLite. But it skips many important details. The following sections will attempt to supplement these areas.

6.1 always record the entire sector

When the original code of the database file is written to the log file (see Section 3.5), SQLite is always written to the full sector, even if the data file page size is less than one sector. For historical reasons, the sector size of SQLite was originally fixed at 512 bytes, and since the minimum page size is 512 bytes, this has never been a problem. Since the SQLite3.3.14 version, it has been possible for SQLite to use mass storage devices with a minimum sector larger than 512 bytes. So, since version 3.3.14, as long as any page in a sector is written to the rollback log file, all sections in the same sector will be written to the log file.

It is important to write all the pages in the sector to the log file, which will prevent database corruption due to a power failure while writing a sector. Fake pages 1, 2, 3, 4, 4 are all saved sectors 1, page 2 has been modified. In order to write this change back to page 2, the actual hardware device will also rewrite the contents of page 1, 3 and 4 at the same time-this is because the hardware must write in sectors. If a write operation is interrupted due to power supply while a write operation is in progress, one or more pages of data in page 1 and 3 will be incomplete and incorrect. So to prevent this corruption, all pages in the same sector of the database file must be written to the log file.

6.2 disposal of garbage when writing log files

When appending data to a log file, SQLite always pessimistically assumes that the file will get bigger first, and that the larger part will fill in some invalid junk data, after which the correct data will replace the garbage. In other words, SQLite assumes that the file is resized before the contents are written in. If a power failure occurs after the file size increases and before the content is finished, these log files will leave some junk data in them. The next time the power is restored, another SQLite process will see the log files that hold the junk data and roll back the junk data to the database file, and the whole database will be dead.

SQLite adopted two preventive measures. First, SQLite records the number of pages contained in the log file in the header of the log file. The initial value of this quantity is 0. So when you try to roll back an incomplete (or incorrect) rollback log file, the process that handles the rollback will see that the log contains only 0 pages, so it will not make any changes to the database. Before submission, the log file is flush to the hard drive to ensure that everything is synchronized to the hard disk and no spam is left in it, and then the total number of pages in the header of the log file is set to real and valid data (the original value is 0). The header of the log file is always stored in a separate sector from all page data to ensure that it can be modified separately and flush, even if a power outage occurs without compromising the data page. Note that the log file is flush twice: the first time the page data is written, and the second time the number of pages is written to the file header.

The previous section describes what happens when synchronouspragma is set to "full".

PRAGMAsynchronous=FULL

The default synchronous setting is "full", so the above description is what usually happens. However, if synchronous is set to "normal", SQLite will only flush the log file once, after the number of pages has been written. This means that there is a risk of data corruption. Because the number of pages that may be modified (non-zero) is written to the hard disk one step earlier than all the page data. Although write requests for data are initiated first, SQLite assumes that the underlying file system may reorder write requests, so it is possible that the number of pages will be written to disk first, even if its write request is last. So as a second precaution, SQLite uses a 32-bit checksum for each page of data in the log file, and these values are used to verify that the pages are valid when the data is rolled back (section 4.4). Once an incorrect checksum is found, the rollback is abandoned. It is important to note that the check value does not ensure that the page data is 100% correct, and there is very little chance that even the wrong checksum of the data is correct. However, the use of checksums can still reduce the possibility of errors to very little.

Note that checksum is not required if synchronous is set to full. We use these checksums only if synchronous is set to normal. However, these checksums are harmless, so whatever the synchronous setting is, they are included in the log file.

6.3 pre-commit cache overflow

The commit process described in section 3.0 assumes that all database changes are appropriate to the user's memory size before committing. This is the usual situation. But sometimes a very large change exceeds the memory cache size of the user space before the transaction is committed. In this case, the cache must write the data to the database before the transaction completes.

At the beginning of the cache overflow, the state of this database join is as mentioned in Section 3.6. The original page data has been written to the rollback log file, and the modified part is still saved in the user's memory. To handle this cache overflow, SQLite executes sections 3. 7 through 3. 9. In other words, the rollback log is flush to the hard disk, the exclusive lock has been applied, and the modification has been written to the database. But the remaining steps are postponed until the transaction is actually committed. The new log file header is appended to the end of the rollback log file (in its own separate sector), and the exclusive lock remains, but other processing goes back to section 3.6. When this transaction commits, or another cache overflow occurs, sections 3.7 and 3.9 occur again (Section 3.8 is omitted in the second or later process because the exclusive lock has been obtained).

A cache overflow elevates the critical lock of the database to exclusive lock. This will reduce concurrency. A cache overflow can also result in additional hard disk flush (fsync) operations, which are slow, so a cache overflow can seriously degrade performance. Therefore, cache overflows should be avoided as much as possible.

7.0 optimization

Performance analysis shows that in most operating systems and environments, SQLite is mainly time-consuming on disk IO. If we can reduce the number of disk IO, it will significantly improve the performance of SQLite. This section describes some of the techniques that SQLite uses to reduce the number of disk IO without affecting the atomicity of the submission.

7.1 Save cache between transactions

During the transaction commit process, section 3.12 states that all cached database content mirroring in user space must be discarded once the shared lock is released. This is because without a shared lock, other processes can modify the contents of the database at will, so the cache of any piece of database data in user space may expire. Therefore, each new transaction attempts to re-read the data it has previously read. This is not as bad as it sounds, because the data read for the first time may still exist in the operating system's disk cache. So this read is really just a copy of data from kernel space to user space. But even so, it takes up cpu time.

Since the beginning of SQLite3.3.14, a new mechanism has been added to reduce unnecessary repeated read operations. In the latest SQLite, pages in user space are cached after the user lock is released. Then, when you start a new transaction, after acquiring a shared lock, SQLite attempts to check whether any process has modified the data during that time. If any changes have taken place in the database during the time the lock is released, the user space cache will be released. However, in general, the data file has not been modified, so the user space cache is retained and some unnecessary read operations are waived.

To determine whether the database file has been modified, SQLite uses a counter that is stored in the header of the database file (byte 24027), and this value is incremented each time a change is made to the database. SQLite records a copy of this value before releasing a lock. When the lock is acquired next time, it will be compared with the previously saved value. If the values are inconsistent, these caches must be cleared, otherwise they can be reused.

7.2 exclusive access mode

SQLite has added a concept of "exclusive access mode" since version 3.3.14. When in exclusive access mode, SQLite retains the exclusive lock after a transaction is completed. This will prevent other processes from accessing the database; since most developers have only one process to access the database, this is not a serious problem in most cases. The benefits of exclusive access mode can reduce the number of disk IO in three ways:

1) it is no longer necessary to modify the change counter in the file header after each transaction is completed. This reduces one page write to the rollback log and database files.

2) No other process modifies the database, so it is not necessary to check the change counter or clear the user space cache at the beginning of a transaction.

3) when a transaction is completed, you can clear the log file header instead of deleting the log file. This avoids modifying the directory entries of the log file and does not have to release the disk sector corresponding to the log file. Also, the next transaction can rewrite (overwrite) the contents of an existing log file instead of appending new content to the new file. In most operating systems, rewrite operations are much faster than append operations.

The third optimization above is to clear the log file header instead of deleting it, instead of relying on holding an exclusive lock all the time. In theory, we can do this optimization at any time, not only in exclusive access mode. This optimization can be set independently of exclusive lock modeusing the journal_mode

Pragma asdescribed in section 7.6 below.

7.3 Free pages do not have to be written into the log

After the information in the SQLite database is deleted, the pages used by the deleted data are added to the empty page list. Subsequent inserts try to use pages in the empty page list first.

Some blank pages contain important data: especially the location of other blank pages. But most blank pages do not contain useful information. This kind of page is called "leaf" page. We can modify the contents of these leaf pages at will without affecting the database.

Because the contents of the leaf pages are unimportant, SQLite avoids saving the contents of these leaf pages to the rollback log file (Section 3.5). If the content of a leaf page is modified, these changes to the leaf page will not be rolled back during transaction recovery. This will not hurt the database. Similarly, the contents of the new empty page linked list are never written back to the database in section 3.9, nor are they read from the database in section 3.3. When changes to database files include blank pages, this optimization can greatly reduce the total number of disk io operations.

7.4 single page update and sector atom writing

Starting with 3.5.0, the new VFS interface includes a new method: xDeviceCharacteristics, which can read the features that the actual file system might have. XDeviceCharacteristics reports whether the file system can support sector write atomic operations.

Recall that in general SQLite assumes that sector writes are linear but non-atomic. Linear writes are modified byte by byte from the end point of another sector to the end point of the sector. If you write a sector, linear writing will modify one part of the sector, while the other part will not change. In the case of a sector atom writing, either the entire sector is rewritten or the sector has not changed.

We believe that most modern disk drives implement atomic write operations. When a power outage occurs, the disk drive can use the electrical energy in the capacitor and / or the angular momentum of the rotation of the disk to complete any ongoing operation. However, there are too many layers between system write calls and disk electronics. So the safer option for VFS implementations on unix and win32 is to assume that sector writes are non-atomic. On the otherhand, device manufactures with more control over their filesystems might wantto

Consider enabling the atomic write property of xDeviceCharacteristics iftheir hardware really does do atomic writes.

When a sector write is atomic, and the sector size is the same as the page size, and a change in the database is only a single page change, SQLite will skip the whole logging process and simply write the modified data back to the database file. The change counter in the front page of the database will be modified independently-because it will not have any impact on the database-even if a power outage occurs before the counter is updated. .

7.5 FilesystemsWith Safe Append Semantics

Another optimization described in SQLite3.5.0 is to take advantage of the "secure append" behavior of the actual disk. In retrospect, SQLite assumes that when appending data to a file (especially for rollback log files), the file size is increased before the data contents are written. So when the size of the file has changed and the content is not finished, there will be some invalid junk data in the new part of the file. VFS's xDeviceCharacteristics can be used to indicate whether the file system implements the "security append" semantics. This means that the contents of the file are written before the file size becomes larger. This prevents junk data from appearing in the rollback log file when the system crashes or power goes down

When a file system has a secure append feature, SQLite always saves a special value:-1 to indicate the total number of pages in the log file. A page count of-1 tells any attempted rollback operator that the number of pages needs to be calculated from the log file size. At the same time, the-1 value is never modified. So, in a commit process, we save a flsuh operation and a sector write operation on the first page of the log file. In addition, when a cache overflow occurs, it is not necessary to add a new log header after the log file. We can simply add some new pages to an existing log file.

7.6 continuous rollback log

Deleting files is an expensive operation on many systems. Therefore, as an optimization scheme, SQLite can be configured to avoid the deletion operations mentioned in Section 3.11. When a transaction commits, the log file is deleted by truncating the header length of the log file to 0 or overwriting the header content with 0. Truncating the length to 0 saves the necessary changes to the directory where the file is located (because the file still exists in this directory). The scheme of rewriting the file header has another advantage: it does not have to update the length of the file (the I node in many systems) and does not need to deal with the newly released disk sectors. Further, the log file for the next transaction is generated by rewriting existing content, rather than appending new content to the end of the file, and the rewrite operation is usually faster than the append operation.

SQLite can instead delete the log file by setting the log mode to "PERSIST" so that the transaction is committed in a way that overrides the log file header with 0. For example:

PRAGMA journal_mode=PERSIST

In many systems, the use of continuous logging mode can lead to significant performance improvements. Of course, the disadvantage is that long after the transaction is committed, the log files will remain on disk, taking up disk space, resulting in directory clutter. The only safe way to delete persistent log files is to set the log mode to DELETE when the transaction is committed:

PRAGMA journal_mode=DELETE

BEGIN EXCLUSIVE

COMMIT

Note: because the log file may still be in use (hot), deleting the persistent log file in other ways will result in corruption of the corresponding database file.

TRUNCATE log mode is supported from SQLite 3.6.4:

PRAGMA journal_mode=TRUNCATE

In truncate log mode, the log file length is set to 0 when the transaction is committed, rather than the deleted file in DELETE mode or the zero header in PERSIST mode. TRUNCATE mode also has the advantage that PERSIST mode does not need to update the log file and the directory where the database is located. Therefore, it is usually faster to truncate a file than to delete it. Another advantage of TRUNCATE is that it does not synchronize updates back to disk with a system call (such as fsync ()), which is safer if you do. But in many modern file systems, truncation is an atomic synchronous operation, and we believe that truncation is safe in the event of a power outage. If you are unsure about the synchronization and atomicity of truncation on your file system, and database security in the event of a power outage or outage is important to you, you should consider using other logging modes.

In embedded operating systems with synchronous file systems, TRUNCATE results in slower behavior than PERSIST. The speed of the commit operation is the same, but the transaction after the TRUNCATE operation is slower because it is faster to overwrite existing content than to append new content to the end of the file. New log files after TRUCATE always use append operations, while PERSIST uses rewrite operations.

8.0 Atomic submission behavior Test

The developers of SQLite are confident of the robustness of SQLite in the face of power failures and system crashes. Because the automated testing process has done a lot of SQLite resilience tests in the face of simulated power failure. We call it a "crash test".

SQLite's crash test uses a modified VFS that simulates a variety of file system corruption in the event of a power outage or system crash. VFS for crash testing can simulate incomplete sector writes, page garbage caused by incomplete writes, and unordered writes, a variety of changes in a test scenario. Crash testing constantly executes transactions, making simulated power outages or system crashes occur at different times, resulting in different data corruption. After the simulated event, after any test reopens the database, it detects whether the transaction is completed or not, and whether the database state is normal.

These crash tests of SQLite found a large number of subtle BUG of the recovery mechanism (all of which have now been fixed). Some of these BUG are very vague if you just observe and analyze what the code cannot find. Through this experiment, SQLite developers feel confident because other databases do not use similar crash tests, and it is likely that they all contain some undetected bug, which can lead to database corruption after a power outage or system crash.

9.0 will lead to the end of things.

SQLite's atomic submission mechanism has proved to be robust. But it can also be framed by some incomplete operating system implementations. This section describes some situations that can cause SQLite data corruption in the event of a power failure or a system crash

9.1 lack of file lock implementation

SQLite realizes that only one process and one database connection can modify the database at a time through the locking of the file system. The file locking mechanism is implemented by the VFS layer, and different operating systems have different implementation methods. SQLite depends on the correctness of this implementation. If, under certain circumstances, two or more processes can write the same database file at the same time, there will be no good fruit.

We have received reports of windows's network file system and NFS latching in some subtle flaws. We cannot verify these reports. But because the network file system itself is difficult to implement locks, there is no reason to doubt these reports. First of all, since the performance is poor, it is recommended that you do not use SQLite in the network file system. But if you have to use a network file to save SQLite data files, consider using other locking mechanisms to prevent multiple processes from writing a data file at the same time when their file locking mechanism goes wrong.

Apple's pre-installed SQLite version of MacOSX has been extended to have an optional locking strategy that works on all network file systems supported by Apple. The extensions used by Apple work well when multiple processes access database files at the same time. Unfortunately, these locking mechanisms are not mutually exclusive, and if one process uses AFP locks to access the file and another process (perhaps another machine) uses dot-file locks to access the file, the two processes may conflict because AFP locks do not repel dot-file locks, and vice versa.

9.2 incomplete disk refresh

SQLite uses fsysnc in unix and FlushFileBuffers under win32 to synchronize the contents of files to disk (sections 3.7 and 3.10). Unfortunately, we have also received reports that neither of them works as advertised on many platforms. We have heard that in some versions of windows, FlushFileBuffersc can completely disable its work by modifying the registry. We were also told that in some earlier versions of Linux, the fsync in some of their file systems was completely empty. Even in systems where FlushFileBuffers and fsync are told to work, IDE hard drives often lie that the data has been written to the disk, but there is only a variable state disk controller cache.

You can set the following items in Mac:

PRAGMA fullfsync=ON

Setting fullfsync on Mac ensures that data is actually written to the disk through flush. However, fullfsync can cause disk control to be reset. This is not slow in the general sense, it can also cause other disk IO to slow down, so this configuration is not recommended.

9.3 partial deletion of files

SQLite assumes that it is an atomic operation from the perspective of the user process. When a power outage occurs during deletion, and when the power is restored, SQLite wants to see that the file either exists in its entirety or cannot be found at all. If the operating system cannot do this, the transaction may not be atomic.

9.4 garbage written to the file

The data file of SQLite is an ordinary disk file, which can be read and written by ordinary users. Some rogue processes may open a SQLite file and write some messy data in it. Confusing data may also be written to a SQLite data file because of the BUG of the operating system. There is nothing SQLite can do about these situations.

Delete or rename the "hot" log file

If a power outage or system crash results in leaving a "hot" log file on disk. In fact, the original data file, plus the remaining "hot" log file, is used by a rollback when SQLite is opened next time, which can restore the normal state of SQLite data (Section 4.2). SQLite looks for possible "hot" log files in the same directory as the database with the open file name. If the data files or log files are moved or renamed, or deleted, the log files will not be rolled back and the database may be corrupted and unusable.

We often suspect that the example of recovery failure in SQLite is like this: there is a power outage, and then the power is restored. A well-intentioned user or system administrator begins to check for disk damage. They see a database file called "important.data", or something like that. But because of the power outage, there is also a log file called important.data-journal. The user deleted the "hot" log file and thought he was cleaning the system. In this case, there is no other way but to conduct user training.

If multiple joins (hard or symbolic joins) point to a data file, the log file is created with the name of the opened join file. If the database is reopened with a new join after the system crash, the "hot" log file will not be found and the data will not be rolled back.

Sometimes, power problems can cause problems with the file system, such as the newly modified file name is lost and transferred to a directory like "/ lost+found". When this happens, the hot log file will not be found, and the recovery will not occur. SQLite attempts to prevent such events from happening when synchronizing a log file by opening and synchronizing the directory where the log file is located. Then, transferring files to "/ lost+found" may result in unrelated files with the same name as the main database file in the same directory by other unrelated processes. Since this is beyond the control of SQLite, there is no good way for SQLite. If you are running on a file system that can easily lead to namespace conflicts, you'd better put each SQLite data file in your private subdirectory.

10.0 Summary and the way forward

Even now, some people have found some failure patterns about the atomic commit mechanism, and developers have to make some patches for it. Such things happen less and less, and the failure model becomes more and more blurred. But it would be foolish to assume that SQLite's atomic submission logic does not have any bug. The developer promises to fix the discovered bug as soon as possible.

Developers are also considering new ways to optimize the submission mechanism. The current VFS implementation of linux,macOSX,win32 uses some pessimistic settings on top of these systems. Maybe after communicating with some experts who know how these systems work, we may relax some of the settings on these systems to make them run faster. In particular, most of the modern file systems we suspect have now shown security append features, and perhaps they have supported atomic operation of sectors. But unless these are clear, SQLite will continue to adopt a safer, conservative approach and prepare for the worst.

On the principle of SQLite atomic submission is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.