How to learn more about delta lake's transaction log 04/16 Update SLTechnology News&Howtos

How to learn more about delta lake's transaction log

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you transaction logs about how to learn delta lake deeply. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Understanding delta lake's transaction log is necessary because it is the backbone of many features, such as ACID transactions, scalable metadata processing, time travel, and so on. The following details what delta lake's transaction log is, how it works at the file level, and how to provide the perfect solution to concurrent read and write problems.

1. What is the delta lake transaction log

The delta lake transaction log, also known as deltalog, is an ordered record of each transaction that has been executed since the delta lake table created the dependency.

two。 The role of transaction logs

2.1 transaction logs are a single source of truth

Delta Lake is built on top of Apache Spark ™to enable multiple reader and writer to read and write delta lake tables simultaneously. The Delta Lake transaction log always shows the user the correct view of the data-recording all changes the user has made to the table.

When the user reads the delta lake table for the first time or executes a new query on the modified table, spark checks the transaction log to see if there are new transaction operations on the table, and if so, updates the user's target table with these new change areas. This ensures that the user's table version is always synchronized with the master record of the most recent query and that the user cannot make different conflicting changes to the table.

2.2 implementation of deltalake atomic operation

One of the four properties of the ACID transaction, atomicity, ensures that the operation performed on the data lake (such as INSERT or UPDATE) is either completed or terminated completely. Without this attribute, hardware failures or software errors can easily cause only part of the data to be written to the table, resulting in data confusion or corruption.

Transaction log is a mechanism that Delta Lake can provide atomicity guarantee. For all purposes and goals, changes will never take effect if they are not recorded in the transaction log. By recording only full and complete transaction behavior and using that record as the only source of truth, Delta Lake transaction logs enable users to process pb-level data and trust the reliability of their processing.

3. Transaction log working mechanism

3.1 split the transaction log into atomic commit

The user modifies the delta lake table, such as the insert,update or delete operation, which delta lake divides into a series of specific steps (one or more).

A). Add file adds a data file

B) .remove file deletes a data file

c)。 Update metadata updates the metadata of the table, such as changing the name of the table, schema, or partition.

D) .set transaction records microbatches submitted by structured streaming job using the given ID.

C). Change protocol is to switch the delta lake transaction log to the latest soft protocol to make its new features take effect.

E) .commit info submits change information.

These operations are then recorded in the transaction log as ordered atomic units.

For example, suppose the user creates a transaction to add new columns to the table and adds more data to it. Delta Lake breaks down the transaction into components, and once the transaction is complete, press the following cmmit to add it to the transaction log:

Update metadata-modify schema to include new columns

Add Fil

4. File-level transaction log

When the user creates the delta lake table, the transaction log for that table is also automatically created, and the directory is a subdirectory of _ delta_log. When the user makes changes to the table, the changes are recorded in the transaction log in the form of orderly and atomic commit. Each submission is written as a json file, starting with 000000.json. The name of the file for subsequent operations on the table will be recorded as 000001.jsond000002.json file and so on, as follows:

So, as an example, we can take data from the 1.parquet and 2.parquet files and add it to the delta lake table. The transaction is automatically added to the transaction log and saved to disk as a commit file 000000.json. Then we can also decide to delete these files and add a new file (3.parquet). These actions will be recorded as the next cmmit file 000001.json in the transaction log, as shown below.

These checkpoint files save the entire state of the table at some point in time-in native Parquet format, which Spark can read easily and quickly. In other words, checkpoint files provide a "shortcut" for Spark's reader to fully reproduce the state of the table, so that Spark can avoid reprocessing thousands of tiny, inefficient JSON files.

To speed up, Spark can run listFrom operations to view all files in the transaction log, quickly skip to the latest checkpoint files, and process only those JSON commits that have taken place since the latest checkpoint files were saved.

To demonstrate how it works, let's assume that our commit file is created to 000007.json, as shown in the following figure. Spark has automatically cached the latest version of the table in memory, speeding up commits. At the same time, however, several other writer (perhaps the required data change commit) have written new data to the table and have been adding commit to the 0000012.json.

To merge these new transactions and update the state of the table, Spark will then run the listFrom version 7 operation to see the new changes to the table.

Instead of processing all intermediate JSON files, Spark can skip to the latest checkpoint file because it contains the entire state of the commit# 10:00 table. For now, Spark only needs to perform the incremental processing of the table, 0000011.json and 0000012.json. Spark then caches version 12 of the table in memory. By following this workflow, Delta Lake can use Spark to efficiently save updates to the table state at any time.

6. Deal with multiple concurrent reads and writes

After the previous introduction, you should have an overview of the working logic of the delta lake transaction log at the top level, followed by a discussion of concurrent processing. The previous focus is on linear or at least conflict-free transaction behavior. But what happens when delta lake handles multiple concurrent reads and writes?

The answer is simple. In order to realize concurrency control, optimistic concurrency control is introduced into delta lake.

6.1 what is optimistic concurrency control

Optimistic concurrency control is a method of dealing with concurrent transactions, which assumes that transactions (changes) made by different users to the table can be completed without conflict. The reason why this method is so fast is that when dealing with PB-level data, different users are very likely to deal with different parts of the data, so that they can complete transactions without conflict at the same time.

For example, suppose An and B are working on jigsaw puzzles together. As long as we all work on different parts of it (for example, An is in the top half and B is in the off-duty part), we can then implement AB to complete part of the puzzle and complete the puzzle twice as fast. Conflicts occur only when we need the same parts. That is optimistic concurrency control.

Of course, even with optimistic concurrency control, sometimes users do try to modify the same part of the data at the same time. Fortunately, Delta Lake has an agreement on this.

6.2 optimistic conflict resolution

To provide ACID transactions, Delta Lake has a protocol to determine the order of commit (the concept of serializability in the database) and to determine what to do if two or more commit occur at the same time. Delta Lake handles these cases by implementing mutex rules and then tries to resolve any conflicts optimistically. This protocol allows Delta Lake to follow the ACID isolation principle, which ensures that the final state of the table is the same after multiple concurrent writes.

Typically, the process is as follows:

The starting version of the record table.

Record read / write.

Try to submit.

If someone wins, please check that what you have read has been changed.

Repeat the above process.

To understand how all this information works in real time, let's take a look at the figure below to see how Delta Lake handles conflicts when they occur. Imagine two users reading data from the same table, and each user tries to add some data to it.

Delta Lake records the starting table version of the table (version 0) that is read before any changes are made.

Both users 1 and 2 try to add some data to the table at the same time. Here, we are caught in a conflict because the next time can only be submitted once and recorded as 000001.json.

Delta Lake resolves this conflict through the concept of "mutual exclusion", which means that only one user can successfully submit the 000001.json. The submission of user 1 is accepted, while the submission of user 2 is rejected.

Delta Lake would rather handle this conflict optimistically than raise an error for User 2. It checks whether any new commits have been made to the table and silently updates the table in response to those changes, then simply retries user 2's commit on the newly committed table (without any data processing) and successfully commits the 000002.json.

In most cases, this conflict resolution is silent, seamless and successful. However, if there is an unsolvable problem that Delta Lake cannot optimistically solve (for example, if user 1 deletes a file that user 2 also deletes), the only option at this point is to throw an error.

Finally, because all transactions on the Delta Lake table are stored directly to disk, this process satisfies the persistence of the ACID property, which means that the effect of the operation persists even if the system fails.

7. Other use cases

7.1 time travel

Each table is the result of the sum of all the commit recorded in the Delta Lake transaction log. The transaction log provides a record of the change process, detailing how to change from the original state of the table to the current state.

Therefore, we can recreate the state of the table at any point in time, starting from the original table, and only deal with the commit before that point. This powerful feature is called "time travel" or data versioning.

3. Time Travel and version Management of data Lake deltalake

7.2 data consanguinity and debug

As a clear record of each change to the table, the Delta Lake transaction log provides users with verifiable data consanguinity that can be used for governance, audit, and compliance purposes. It can also be used to trace the source of an unexpected change or an error in the pipeline process to the exact action that caused the error. Users can run DESCRIBE HISTORY to view metadata about their changes.

The above is the editor for you to share how to learn the delta lake transaction log, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.