Series of articles on TiKV Source Code parsing (11) Storage-transaction Control 07/01 Update SLTechnology News&Howtos

Series of articles on TiKV Source Code parsing (11) Storage-transaction Control

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Background knowledge

TiKV is a strongly consistent distributed KV storage that supports transactions. TiKV uses raft to ensure strong consistency among multiple replicas. The transaction TiKV refers to Google's Percolator transaction model and makes some optimizations.

When the Service layer of TiKV receives the request, it forwards the request to different modules for processing according to the type of request. Read requests pushed down from TiDB, such as sum,avg operations, are forwarded to the Coprocessor module for processing, and KV requests are forwarded directly to Storage for processing.

KV operations can be divided into two categories: Raw KV operations and Txn KV operations according to their functions. Raw KV operations include raw put, raw get, raw delete, raw batch get, raw batch put, raw batch delete, raw scan and other common KV operations. Txn KV operation is a series of operations designed to implement the transaction mechanism. For example, prewrite and commit correspond to prepare and commit operations in 2PC, respectively.

This article will introduce the Storage module in the TiKV source code, which is located between Service and the underlying KV storage engine, and is mainly responsible for the concurrency control of transactions. The transaction-related implementations on the TiKV side are all in the Storage module.

Source code parsing

Next, we will explain the source code related to Storage from several aspects, such as Engine, Latches, Scheduler and MVCC.

1. Engine trait

TiKV abstracts the underlying KV storage engine into an Engine trait (trait is similar to interface in other languages), as defined in storage/kv/mod.rs. Engint trait mainly provides two interfaces: read and write, which are async_snapshot and async_write. The caller sends the content to be written to async_write,async_write by callback to inform the caller that the write operation completed successfully or encountered an error. Similarly, async_snapshot returns a snapshot of the database to the caller through a callback for the caller to read, or returns the error encountered to the caller.

Pub trait Engine: Send + Clone + 'static {type Snap: Snapshot; fn async_write (& self, ctx: & Contect, batch: Vec, callback: Callback)-> Result; fn async_snapshot (& self, ctx: & Context, callback: Callback)-> Result;}

As long as the above two interfaces are implemented, they can be used as the underlying KV storage engine for TiKV. In version 3.0, TiKV supports three different KV storage engines, including the stand-alone RocksDB engine, the memory B-tree engine, and the RaftKV engine, which are located in rocksdb_engine.rs, btree_engine.rs, and raftkv.rs under the storage/kv folder. Among them, the stand-alone RocksDB engine and the in-memory red-black tree engine are mainly used for unit testing and hierarchical benchmark,TiKV. The RaftKV engine is really used. When calling async_write of RaftKV for write operation, if async_write returns successfully through callback, it means that the write operation has been replicated to most replicas through raft, and the write is completed on the leader node (the caller's TiKV), and the subsequent reads on the leader node can see the content previously written.

2. Raw KV execution process

Raw KV series interface is the interface that directly manipulates the underlying data without transaction control, so it is relatively simple, so before introducing the more complex transaction KV execution flow, we first introduce the execution flow of Raw KV.

Raw put

Raw put operation does not need the Storage module to do extra work, just send the content to be written directly to the underlying KV storage engine through the async_write interface of engine. The call stack is service/kv.rs: raw_put-> storage/mod.rs: async_raw_put.

Impl Storage {pub fn async_raw_put (& self, ctx: Context, cf: String, key: Vec, value: Vec, callback: Callback,)-> Result {/ / Omit some limit checks about key and value here... Self.engine.async_write (& ctx, vec! [Modify::Put (Self::rawkv_cf (& cf), Key::from_encoded (key), value,)], Box::new (| (_, res) | callback (res.map_err (Error::from),)? Ok (())}} Raw get

Similarly, raw get only needs to call engine's async_snapshot to get the database snapshot and read it directly. Of course, for the RaftKV engine, async_snapshot will check whether the currently accessed copy is leader before returning the database snapshot. (version 3.0.0 only supports reading from leader, and follower read is still under development). In addition, it will also check whether the region version information in the request is new enough.

3. Latches

In transactional mode, in order to prevent multiple requests from writing to the same key at the same time, the request must acquire the memory lock of the key before writing the key. To distinguish it from the lock in the transaction, we call this memory lock latch, which corresponds to the Latch structure in the storage/txn/latch.rs file. Each Latch contains a waiting queue. Requests that do not receive the latch are inserted into the waiting queue in order, and the request from the head of the queue is considered to have received the latch.

# [derive (Clone)] struct Latch {pub waiting: VecDeque,}

Latches is a structure that contains multiple Latch, and each slot that contains a fixed-length Vector,Vector corresponds to a Latch. The length of the internal Vector of Latches is 2048000 by default. Each TiKV has one and only one instance of Latches, which is located in Storage.Scheduler.

Pub struct Latches {slots: Vec, size: usize,}

The gen_lock interface of Latches is used to calculate all the latch that needs to be obtained before the write request is executed. Gen_lock calculates the hash of all the key, and then uses these hash to module the length of the Vector to get multiple slots, and the slots is sorted to get all the latch needed by the command. The purpose of sorting in this process is to ensure the sequence of getting latch to prevent deadlock conditions.

Impl Latches {pub fn gen_lock (& self, keys: & [H])-> Lock {/ / prevent from deadlock, so we sort and deduplicate the index. Let mut slots: Vec = keys.iter () .map (| x | self.calc_slot (x)) .collect (); slots.sort (); slots.dedup (); Lock::new (slots)} 4. Storage and transaction Scheduler SchedulerStorage

Storage is defined in the storage/mod.rs file, so let's introduce several important members of Storage:

Engine: represents the underlying KV storage engine.

Sched: transaction scheduler, responsible for scheduling concurrent transaction requests.

Read_pool: read thread pool in which all read-only KV requests, including transactional non-transactional ones such as raw get, txn kv get, etc., will eventually be executed. Because the read-only request does not need to get the latches, it is assigned a separate thread pool for direct execution rather than sharing the transaction scheduler with non-read-only transactions.

Gc_worker: since version 3. 0, TiKV supports distributed GC, and each TiKV has a gc_worker thread responsible for periodically updating safepoint from PD, and then doing GC work.

Pessimistic_txn_enabled: pessimistic transactions are also supported in version 3.0. for true, pessimistic_txn_enabled means that TiKV starts in a mode that supports pessimistic transactions. There will be a source code reading article on pessimistic transactions, which we will skip here.

Pub struct Storage {engine: E, sched: Scheduler, read_pool: ReadPool, gc_worker: GCWorker, pessimistic_txn_enabled: bool, / / Other fields...}

Read-only requests, including txn get and txn scan,Storage, call engine's async_snapshot to take a snapshot of the database and hand it over to the read_pool thread pool for processing. Write requests, including prewrite, commit, rollback, etc., are directly handed over to Scheduler for processing. The definition of Scheduler is in storage/txn/scheduler.rs.

Schedulerpub struct Scheduler {engine: Option, inner: Arc,} struct SchedulerInner {id_alloc, AtomicU64, task_contexts: Vec, lathes: Latches, sched_pending_write_threshold: usize, worker_pool: SchedPool, high_priority_pool: SchedPool, / / Some other fields...}

Next, let's briefly introduce several important members of Scheduler:

Id_alloc: all requests that arrive at Scheduler are assigned a unique command id.

Latches: after the write request arrives at Scheduler, it will try to get the required latch. If the required latch is not available temporarily, the corresponding command id will be inserted into the waiting list of latch. When the previous request is executed, it will wake up the request in waiting list to continue execution. This part of the logic will be described in the next section of prewrite request execution process in scheduler.

Task_contexts: the context used to store all requests in the Scheduler, such as requests that are temporarily unable to get the required latch, will be temporarily stored in the task_contexts.

Sched_pending_write_threshold: used to count the write traffic of all write requests in the Scheduler, which can be used to flow control the write operations of the Scheduler.

Worker_pool,high_priority_pool: two thread pools in which write requests need to be checked for transaction constraints before calling engine's async_write.

The execution flow of prewrite request in Scheduler

Let's take the prewrite request as an example to explain how the write request is handled in Scheduler:

1) when Scheduler receives a prewrite request, it will first perform flow control judgment. If there are too many requests in Scheduler, it will directly return a SchedTooBusy error and prompt it to be sent later, otherwise it will proceed to the next step.

2) then try to get the required latch, and if the latch is successfully obtained, then proceed directly to the next step. If you fail to obtain the latch, it means that another request has occupied the latch, which means that other requests may also be operating on the same key. Then the current prewrite request will be temporarily suspended and the upper and lower text of the request will be temporarily stored in the task_contexts of the Scheduler. When the previous request execution is finished, the prewrite request will be awakened to continue execution.

Impl Scheduler {fn try_to_wake_up (& self, cid: U64) {if self.inner.acquire_lock (cid) {self.get_snapshot (cid);}} fn release_lock (& self, lock: & Lock, cid: U64) {let wakeup_list = self.inner.latches.release (lock, cid) For wcid in wakeup_list {self.try_to_wake_up (wcid);}

3) after the latch is successfully obtained, the get_snapshot API of Scheduler is called to obtain a snapshot of the database from engine. The inside of get_snapshot is actually calling the async_snapshot interface of engine. Then the prewrite request and the database snapshot just obtained are handed over to worker_pool for processing. If the prewrite request priority field is high, it will be distributed to high_priority_pool for processing. High_priority_pool is designed for high priority requests, such as some requests within the TiDB system that require TiKV to return quickly and cannot be stuck because worker_pool is busy. It is important to note that currently high_priority_pool and worker_pool are just two thread pools that are semantically different and have the same operating system scheduling priority within them. Zhengzhou professional infertility hospital: http://yyk.39.net/zz3/zonghe/1d427.html

4) after worker_pool receives the prewrite request, the main task is to confirm whether the current prewrite request can be executed from the database snapshot, such as whether a larger ts transaction has modified the data. For details, please refer to the Percolator paper or our official blog "Overview of TiKV transaction Model". When it is determined that prewrite is executable, the async_write interface of engine is called to perform the actual write operation. For the specific code in this section, see the process_write_impl function in storage/txn/process.rs.

5) when the async_write execution succeeds or fails, the release_lock function of Scheduler is called to release the latch and wake up the request waiting on these latch to continue execution.

5. MVCC

The code related to TiKV MVCC is located in the storage/mvcc folder. It is strongly recommended that you read the Percolator paper or our official blog "Overview of the TiKV transaction Model" before reading this part of the code.

There are two key structures under MVCC, MvccReader and MvccTxn. MvccReader, which is located in the storage/mvcc/reader/reader.rs file, mainly provides read functions and hides multi-version processing details internally. For example, the get API of MvccReader inputs the key and ts to be read, returns the version that can be seen by the ts, or returns a key is lock error.

Impl MvccReader {pub fn get (& mut self, key: & Key, mut ts: U64)-> Result;}

MvccTxn, which is located in the storage/mvcc/txn.rs file, mainly provides the transaction constraint checking function before writing. The fourth step in the processing flow of the prewrite request in the previous section is to check the transaction constraints by calling the prewrite interface of MvccTxn. How is the reputation of Jiaozuo Gastrointestinal Hospital: http://jz.lieju.com/zhuankeyiyuan/37756433.htm

Summary

TiKV-side transaction-related implementations are located in the Storage module. This article gives you a brief overview of several key points in this section. Readers who want more details can read the source code of this part (code talks XD). In addition, since version 3.0, TiDB and TiKV support pessimistic transactions, and the corresponding code on the TiKV side is mainly located in storage/lock_manager and the MVCC module mentioned above.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.