How to understand SSTable in LevelDB source code 07/16 Update SLTechnology News&Howtos

How to understand SSTable in LevelDB source code

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

How to understand LevelDB source code in SSTable, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

MemTable is a memory table, and when the memory table grows to a certain extent (memtable.size > Options::write_buffer_size), the current MemTable data is persisted (there are actually two MemTable in LevelDB, which will be mentioned later in the LevelDB database memo). Persistent files (sst files) called Table in Table,LevelDB are divided into different levels. The level of the current version is 7 (0-6), the data of level0 in table is * *, and the data of level6 is the oldest.

The Compaction action is responsible for triggering the conversion from memory table to SSTable, which is also performed when the LOG is restored. You don't care about any details of Compaction or recovery here, and you will remember it separately later.

LevelDB completes the construction of the SSTable through the BuildTable method, which creates the SSTable file and writes the records in the memtable to the file in turn. BuildTable takes an output parameter, FileMetaData:

1 struct FileMetaData {2 int refs; 3 int allowed_seeks; / / Seeks allowed until compaction 4 uint64_t number; 5 uint64_t file_size; / / File size in bytes 6 InternalKey smallest; / / Smallest internal key served by table 7 InternalKey largest / / Largest internal key served by table 8 9 FileMetaData (): refs (0), allowed_seeks (1 closed); 4 if (! ok ()) return; 5 if (r-> num_entries > 0) {6 assert (r-> options.comparator- > Compare (key, Slice (r-> last_key)) > 0); 7} 8 9 / / Index Block:Data Block index metadata. 10 if (r-> pending_index_entry) {11 assert (r-> data_block.empty ()); 12 r-> options.comparator- > FindShortestSeparator (& r-> last_key, key); 13 std::string handle_encoding; 14 r-> pending_handle.EncodeTo (& handle_encoding); 15 r-> index_block.Add (r-> last_key, Slice (handle_encoding)) 16 r-> pending_index_entry = false; 17} 18 19 r-> last_key.assign (key.data (), key.size ()); 20 r-> num_entries++; 21 r-> data_block.Add (key, value); 22 23 const size_t estimated_block_size = r-> data_block.CurrentSizeEstimate () 24 if (estimated_block_size > = r-> options.block_size) {25 Flush (); / / exceeds the single block size and writes to the file. 26} 27}

The Add method creates Data Block, and IndexBlock,DataBlcok is brushed into the disk file through Flush.

Let's look at the Finish method:

1 Status TableBuilder::Finish () {2 / / Data Block 3 Rep* r = rep_; 4 Flush (); 5 6 assert (! r-> closed); 7 r-> closed = true; 8 9 / / Meta Block 10 BlockHandle metaindex_block_handle; 11 BlockHandle index_block_handle 12 if (ok ()) 13 {14 BlockBuilder meta_index_block (& r-> options); 15 / / TODO (postrelease): Add stats and other meta blocks 16 WriteBlock (& meta_index_block, & metaindex_block_handle) 17} 18 19 / / Index Block 20 if (ok ()) {21 if (r-> pending_index_entry) {22 r-> options.comparator- > FindShortSuccessor (& r-> last_key); 23 std::string handle_encoding; 24 r-> pending_handle.EncodeTo (& handle_encoding) 25 r-> index_block.Add (r-> last_key, Slice (handle_encoding)); 26 r-> pending_index_entry = false; 27} 28 WriteBlock (& r-> index_block, & index_block_handle); 29} 30 31 / / Footer 32 if (ok ()) 33 {34 Footer footer 35 footer.set_metaindex_handle (metaindex_block_handle); / / 36 footer.set_index_handle (index_block_handle); 37 std::string footer_encoding; 38 footer.EncodeTo (& footer_encoding); 39 r-> status = r-> file- > Append (footer_encoding) 40 if (r-> status.ok ()) {41 r-> offset + = footer_encoding.size (); 42} 43} 44 return r-> status; 45}

Finish writes in turn: a piece of Data Block that has not yet been written and Meta Block, Index Block, Footer. Meta Block is not in use, while Footer saves the location information of meta block and index block.

Block

Data Block, Meta Block and Index Block are service divisions, which represent user data block, metadata block and user data index block respectively. Its storage format is Block structure:

Record represents a piece of data, and the blue and red parts (collectively referred to as "restart points") are additional information, but what do these do? Two points: performance optimization and space saving.

Let's first look at the logic for building the Restart list:

1 void BlockBuilder::Add (const Slice& key, const Slice& value) {2 Slice last_key_piece (last_key_); 3. 4 size_t shared = 0; 5 if (counter_

< options_->

Block_restart_interval) {6 / / See how much sharing to do with previous string 7 const size_t min_length = std::min (last_key_piece.size (), key.size ()); 8 while ((shared)

< min_length) && (last_key_piece[shared] == key[shared])) { 9 shared++; 10 } 11 } 12 else { //restart时保存完整的key值 13 // Restart compression 14 restarts_.push_back(buffer_.size()); 15 counter_ = 0; 16 } 17 const size_t non_shared = key.size() - shared; 18 19 //Record信息 20 // shared size | no shared size | value size | no shared key data | value data 21 // Add "" to buffer_ 22 PutVarint32(&buffer_, shared); 23 PutVarint32(&buffer_, non_shared); 24 PutVarint32(&buffer_, value.size()); 25 // Add string delta to buffer_ followed by value 26 buffer_.append(key.data() + shared, non_shared); 27 buffer_.append(value.data(), value.size()); 28 29 // Update state 30 last_key_.resize(shared); 31 last_key_.append(key.data() + shared, non_shared); 32 assert(Slice(last_key_) == key); 33 counter_++; 34 } 每隔一定间隔(block_restart_interval)Record就会创建一个新的重启点，重启点内容为当前block的大小，即重启点在block内的偏移。每当添加一个新的重启点时，重启点指向位置的Record中一定保存了完整的key值(shared size = 0),随后的Record中保存的key值仅为和上一个Record的差异值。因为Key在Block中是有序排列的，所以相邻key值重叠区域节省的空间还是非常可观的。基于上述实现，问题来了：当需要定位一条记录时，因为record中key的信息是不完整的，仅包含了和上一条的差异项，但上一条记录本身也只包含了和再上一条的差异项，那么定位一条记录时如何做key比较?如果需要一直向上查找完成key值拼接，性能上会不会有损伤? 分析这个问题就要了解重启点的定位：Block的一级索引，SSTable的二级索引(Index Block是SSTable的一级索引)。本文将每个重启点记录位置所属的Record列表称为一个Restart Block 假设每条record记录的都是完整的key值时，从SSTable中查找一条记录的工作流如下：根据Key值从Index Block中找到所属的Data Block 根据Key值从"重启点"列表中找到所属的Restart Block，从Restart Block的起始位置进行key值比较，找到正确的记录。在上述流程中，我们必定会先找到一个Restart Point，随后进行key值比较，而Restart Point记录本身包含了完整的key值信息，后续key值均可基于此key得到。 Restart列表本身做为索引，提升了查找性能，而key值存储的小技巧又降低了空间使用率，在不损伤性能的情况小降低空间利用率，这是一个很好的例子。即使这样，作者觉得还不够，空间利用率还可以进一步优化，并且不损伤任何读取数据的性能。做法和Restart列表的做法类似，是在Index Block中，通过调用FindShortestSeparator / FindShortSuccessor方法实现。 1 // If *start < limit, changes *start to a short string in [start,limit). 2 // Simple comparator implementations may return with *start unchanged, 3 // i.e., an implementation of this method that does nothing is correct. 4 virtual void FindShortestSeparator(std::string* start, const Slice& limit) const = 0; 5 6 // Changes *key to a short string >

= * key. 7 / / Simple comparator implementations may return with * key unchanged, 8 / / i.e., an implementation of this method that does nothing is correct. 9 virtual void FindShortSuccessor (std::string* key) const = 0

FindShortestSeparator finds the shortest key value between start and limit, for example, the shortest key value between "helloworld" and "hellozoomer" can be "hellox". FindShortSuccessor is even more extreme, and is used to find the minimum key that is larger than the key value. For example, if "helloworld" is passed in, the returned key value may be "I".

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.