RocksDB locking mechanism 07/11 Update SLTechnology News&Howtos

RocksDB locking mechanism

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

RocksDB, as an open source storage engine, supports ACID features of transactions, and concurrency control is necessary to support I (Isolation) in ACID. This paper mainly discusses the implementation of RocksDB lock mechanism, and the details will involve source code analysis. I hope that readers can deeply understand the principle of RocksDB concurrency control through this article. This article is mainly carried out from the following four aspects. First, I will introduce the basic structure of RocksDB lock, then I will introduce the lock space overhead under the design of RocksDB row lock data structure, and then I will introduce the locking process of several typical scenarios. Finally, I will introduce the indispensable deadlock detection mechanism in the locking mechanism.

1. Row lock data structure

The minimum granularity of RocksDB lock is row. For KV storage, the lock object is key, and each key corresponds to a LockInfo structure. All key are managed through the hash table, and when looking for locks, you can determine whether the key has been locked by directly locating through the hash table. However, if there is only one hash table globally, it will cause a lot of conflicts to access the hash table and affect the concurrency performance. RocksDB is first split according to Columnfamily, the lock in each Columnfamily is managed by a LockMap, and each LockMap is divided into several fragments, each slice is managed by LockMapStripe, and the hash table (std::unordered_map) exists in the Stripe structure, and the Stripe structure also contains a mutex and condition_variable. This main function is to mutually exclusive access to the hash table, when there is a lock conflict, suspend the thread, unlock, wake up the suspended thread. This design is very simple, but it also brings an obvious problem, that is, multiple unrelated locks share a single condition_variable, resulting in unnecessary awakening of a number of threads when the lock is released, and these threads still need to wait after retry, resulting in invalid context switching. Compared with the InnoDB lock mechanism we discussed earlier, we find that InnoDB is a record multiplexing lock in a page, and the reuse is conditional, and the same transaction locks several records of a page before it can be reused; and the lock waiting queue is precise waiting, accurate to the record level, will not lead to invalid wake-up. Although the design of RocksDB lock is rough, some optimizations have been made. For example, when managing LockMaps, each thread's cache is chained through a global linked list by caching a copy of lock_maps_cache_, locally in each thread. When LockMaps changes (delete columnfamily), the copy of each thread is emptied globally. Because there are few columnfamily changes, most of the access LockMaps operations do not need to be locked, which improves the concurrency efficiency.

The relevant data structures are as follows:

123456789101112131415161718192021222324252627282930313233struct LockInfo {bool exclusive; / / exclusive lock or shared lock autovector txn_ids; / / transaction list. For shared locks, the same key can correspond to multiple transactions / / Transaction locks are not valid after this time in usuint64_t expiration_time;} struct LockMapStripe {/ / Mutex must be held before modifying keys mapstd::shared_ptr stripe_mutex; / / Condition Variable per stripe for waiting on a lockstd::shared_ptr stripe_cv / / Locked keys mapped to the info about the transactions that locked them.std::unordered_map keys;} struct LockMap {const size_t num_stripes_; / / number of shards std::atomic lock_cnt {0}; / / number of locks std::vector lock_map_stripes_; / / lock shards} class TransactionLockMgr {using LockMaps = std::unordered_map;LockMaps lock_maps_; / / Thread-local cache of entries in lock_maps_. This is an optimization// to avoid acquiring a mutex in order to look up a LockMapstd::unique_ptr lock_maps_cache_;}

two。 Row lock space cost

Since the lock information is resident memory, let's briefly analyze the memory occupied by RocksDB locks. Each lock is actually an element in unordered_map, then the memory consumed by the lock is key_length+8+8+1, assuming that key is bigint, accounting for 8 bytes, then 100w rows of records consume about 22m of memory. However, due to the positive correlation between memory and key_length, the memory consumption of RocksDB is uncontrollable. We can simply calculate the range of key_length when RocksDB is used as the MySQL storage engine. For single-column indexes, the maximum value is 2048 bytes. For more information, please see max_supported_key_part_length implementation. For composite indexes, the maximum index length is 3072 bytes. For more information, please see max_supported_key_length implementation. Suppose the worst case, key_length=3072, 100w rows of records, need to consume 3G memory, if it is to lock 100 million rows of records, it will need to consume 300g of memory, in this case the memory will have the risk of bursting. Therefore, RocksDB provides the parameter configuration max_row_locks to ensure that the memory is controllable. The default RDB_MAX_ROW_LOCKS is set to 1 GB. For most key scenarios, it also consumes 22 GB of memory in extreme cases. In this respect, InnoDB is more friendly, and the key of the hash table is (space_id, page_no), so the memory consumption of the key part is constant no matter how large the key is. I also mentioned earlier that InnoDB is optimized in a scenario where a transaction requires a large number of locks, and multiple records can share a lock, which indirectly reduces memory.

3. Locking process analysis

I briefly understood the design of the RocksDB lock data structure and the memory resource consumption of the lock. This section focuses on how RocksDB is locked in several typical scenarios. Like InnoDB, RocksDB also supports MVCC and cannot be locked. For convenience, the following discussion is based on RocksDB as an engine of MySQL, which mainly includes three categories, primary key-based update, secondary index-based update, primary key-based range update, and so on. Before we start the discussion, it is important to note that RocksDB is different from InnoDB in that RocksDB updates are based on snapshots, while InnoDB updates are based on current reads, which makes the performance different in practical applications at the same isolation level. For RocksDB, at the RC isolation level, a new snapshot is taken at the beginning of each statement; at the RR isolation level, only one snapshot is taken at the beginning of the first statement in the entire transaction, and all statements share the snapshot until the transaction ends.

3.1. Update based on primary key

The main interface here is TransactionBaseImpl::GetForUpdate.

1)。 Try to lock the key and wait if the lock is held by another transaction

2)。 Create snapshot

3)。 Call ValidateSnapshot,Get key to determine whether the key has been updated by comparing the Sequence

4)。 Since the snapshot is acquired after locking, the check must be successful.

5)。 Perform update operation

There is a mechanism for delaying the acquisition of snapshots. In fact, acquire_snapshot needs to be called to obtain snapshots at the beginning of the statement, but in order to avoid retries caused by conflicts, key is locked and then snapshot is obtained, which ensures that there will be no ValidateSnapshot failure scenarios based on primary key updates.

The stack is as follows:

12345678910111213141516171819202122232425261-myrocks::ha_rocksdb::get_row_by_rowid2-myrocks::ha_rocksdb::get_for_update3-myrocks::Rdb_transaction_impl::get_for_update4-rocksdb::TransactionBaseImpl::GetForUpdate {/ / Lock 5-rocksdb::TransactionImpl::TryLock 6-rocksdb::TransactionDBImpl::TryLock 7-rocksdb::TransactionLockMgr::TryLock / / delay taking snapshots Use 6-SetSnapshotIfNeeded () / / with acquire_snapshot to check whether the snapshot corresponding to key has expired 6-ValidateSnapshot 7-rocksdb::TransactionUtil::CheckKeyForConflict 8-rocksdb::TransactionUtil::CheckKey 9-rocksdb::DBImpl::GetLatestSequenceForKey / / first read / / read key5-rocksdb::TransactionBaseImpl::Get 6-rocksdb::WriteBatchWithIndex::GetFromBatchAndDB 7-rocksdb::DB::Get 8-rocksdb::DBImpl::Get 9-rocksdb::DBImpl::GetImpl / / second read}

3.2. Range update based on primary key

1)。 Create a Snapshot to scan the primary key based on the iterator

2)。 Try to lock the key through get_row_by_rowid

3)。 Call ValidateSnapshot,Get key to determine whether the key has been updated by comparing the Sequence

4)。 If key has been updated by another transaction (the SequenceNumber corresponding to key is newer than Snapshot), trigger a retry

5)。 In the case of a retry, the old snapshot will be released and the lock will be released. Through tx- > acquire_snapshot (false), the snapshot will be delayed (after locking, snapshot will be taken)

5)。 Call get_for_update again, and since key is locked at this time, the retry is sure to succeed.

6)。 Perform update operation

7)。 Jump to 1 and continue execution until the primary key does not meet the criteria.

3.3. Update based on secondary index

This scenario is similar to 3.2, except that there is one more step in the process of locating the primary key from the secondary index.

1)。 Create Snapshot, scan secondary index based on iterator

2)。 Finding the primary key based on the secondary index is actually a call to get_row_by_rowid, which attempts to lock the key.

3)。 Continue traversing the next primary key based on the secondary index, trying to lock

4)。 Ends when the returned secondary index does not meet the condition

3.4 the difference between locking and InnoDB

As we mentioned earlier, the difference between RocksDB and InnoDB is that for update scenarios, RocksDB is still a snapshot read, while InnoDB is the current read, resulting in a difference in behavior. For example, in the scope update scenario under the RC isolation level, for example, a transaction needs to update 1000 records. Because it is locked while scanning, it may be found that the Sequence of this key is larger than the scanned snapshot (this key has been updated by other transactions) when the 999th record is scanned. At this time, a new snapshot will be triggered, and then the latest key value will be obtained based on this snapshot. InnoDB does not have this problem, through the current reading, scanning process, if the 999 record is updated, InnoDB can directly see the latest record. In this case, RocksDB and InnoDB see the same results. In another case, assuming that the key is also in the scanning range, the Sequence of the key will no doubt be larger than the scanned Snapshot, so the key will be filtered out during the Scan process, and there will be no so-called conflict detection, and the key will not be found. During the update process, two records with id 1 and 900 were inserted, and finally the 900th record could not be updated because it was not visible. For InnoDB, because it is currently read, the newly inserted record with an id of 900 can be seen and updated, so here is the difference from InnoDB.

In addition to updating the difference based on snapshots, RocksDB is also more concise in locking, all locking only involves a unique index, specifically, in the update process, only the primary key is locked; when updating the column involves unique constraints, it needs to be locked; and ordinary secondary indexes do not need to be locked, this purpose is to avoid unique constraint conflicts. Here, if the unique constraint (primary key, or unique index) is updated, it needs to be locked. On the other hand, InnoDB needs to lock each index, such as locating updates based on the secondary index, then the secondary index also needs to be locked. The reason for this difference is that InnoDB is in order to achieve the RR isolation level. Let's talk a little bit about the isolation level. In fact, the RR isolation level defined in MySQL is a little different from the isolation level defined by the SQL standard. The SQL standard defines RR isolation levels to address unrepeatable reads, and Serializable isolation levels to solve phantom reading problems. Unrepeatable reading emphasizes that the value of the same record will not be modified, while illusory reading focuses on that the number of records returned by two readings is fixed and will not increase or decrease the number of records. MySQL defines the RR isolation level to solve both unrepeatable and phantom reading problems, while the implementation of the RR isolation level in InnoDB relies on GAP locks. RocksDB does not support GAP locks (only unique constraint checking is supported and locks are added to key that do not exist), because the snapshot-based mechanism can effectively filter out newly inserted records, while InnoDB needs to prohibit other insertions through gap locks because of current reads, so secondary indexes also need to be locked, mainly for the purpose of lock gaps, otherwise the results of the two current reads may not be the same. Of course, for the RC fragmentation level, there is no need to lock the InnoDB ordinary secondary index.

4. Deadlock detection algorithm

Deadlock detection uses DFS ((Depth First Search, depth-first algorithm). The basic idea is to join the waiting relationship and continue to find the waiting relationship of the waiting. If it is found to be a loop, it is considered that a deadlock has occurred. Of course, in a large concurrent system, the lock waiting relationship is very complex. In order to control the resource consumption caused by deadlock detection in a certain range, the depth of deadlock detection search can be controlled by setting deadlock_detect_depth. Or in a specific business scenario, if you think that deadlock will not occur, then turn off deadlock detection, which is conducive to the improvement of system concurrency to a certain extent. It should be noted that if the deadlock is closed, it is best to set the lock wait timeout to be small, so as to avoid the transaction staying in hang for a long time when the deadlock does occur in the system. The basic process of deadlock detection is as follows:

1. Navigate to a specific part and get the mutex

two。 Call AcquireLocked to try to lock

3. If locking fails, deadlock detection will be triggered.

4. Add a waiter by calling IncrementWaiters

5. If the waiter is not in the waiting map, there will certainly be no deadlock. Return

6. For those who are waiting, check the waiting relationship down the wait_txn_map_ to see if it is looped.

7. If a loop is found, DecrementWaitersImpl will be called to release the newly added waiting relationship and report a deadlock error.

Related data structures:

123456789101112131415161718192021class TransactionLockMgr {/ / Must be held when modifying wait_txn_map_ and rev_wait_txn_map_.std::mutex wait_txn_map_mutex_; / / Maps from waitee-> number of waiters.HashMap rev_wait_txn_map_; / / Maps from waiter-> waitee.HashMap wait_txn_map_; DecrementWaiters / / IncrementWaiters / /} struct TransactionOptions {bool deadlock_detect = false; / / whether to detect deadlock int64t deadlock_detect_depth = 50 / / deadlock detection depth int64_t lock_timeout =-1; / / waiting time for lock. Generally set online to 5sint64_t expiration =-1; / / hold lock time,}

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.