In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
In this article, the editor introduces in detail "what is the overall structure of LevelDB". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "what is the overall structure of LevelDB" can help you solve your doubts.
A metaphor.
LevelDB is somewhat similar to architecture, divided into two parts: the foundation and the ground, that is, disk and memory, while the foundation is like the structure of the earth's crust is divided into many levels, and different levels of data are regularly moved from top to bottom-sedimentation. If the cold data at the bottom of the disk is modified, it will enter memory again, and after a period of time it will be persistently brushed back to the shallow layer of the disk file, and then slowly move down to the bottom, just like the earth's water cycle over and over again.
Memory structure
LevelDB maintains two hop lists in memory, a read-only rtable and a modifiable wtable. The jump list is explained in detail in my other book, Redis Deep Adventures, so I won't repeat it in detail here. To put it simply, a jump list is a Key-ordered Set collection, and the sorting is determined by the global "comparator", which defaults to lexicographic order. The time complexity of finding and updating the jump list is both Log (n).
The jump list is made up of multiple levels of linked lists, of which the lowest linked list stores all the Key, which are ordered. The ordinary linked list does not support fast binary lookup, but the special structure of the skip linked list allows the lowest linked list to locate to the specified node with approximately the efficiency of the binary search algorithm. The simple understanding is that the jump list has both the fast positioning ability of ordered arrays and the efficient ability of adding and deleting linked lists. But it will pay a certain price and have a certain complexity in implementation.
If only Key is stored in the jump list, where does the Value store? The answer is that Value also exists in the Key of the jump list. The Key stored in the jump list is special. It is a composite structure string that contains both the Key and Value of the key-value pair.
Where sequence is the global self-increasing sequence number, LevelDB encounters a modification operation, and the global sequence number is automatically incremented by one. Key in LevelDB stores multiple versions of Value. LevelDB uses the sequence number to mark the version of the key-value pair, and the larger the sequence number, the newer the corresponding key-value pair.
Type is a data type, and the tag is a Put or Delete operation. There are only two values, and 0 means Delete,1 represents Put.
Internal_key = key + sequence + type
Key = internal_key_size + internal_key + value_size + value
If it is a delete operation, the following value_size field value is 0 and the value field value is empty. We want to equate the Delete operation as a Put operation. At the same time, in order to save storage space, both internal_key_size and value_size should use varint integer coding.
If there are multiple modifications to the same key in the jump list, that is, there are multiple "compound Key", then these "compound Key" must be sorted by sequence value next to each other. When the Get operation arrives, it will locate the location of the key in the jump list, select the "compound Key" of the largest seq in the same key, extract the value value and return it.
After the Put and Delete operation logs are written to the log file, their key-value pairs are merged into a "compound Key" and inserted into the specified location of wtable.
When the size of the wtable reaches a threshold, LevelDB solidifies it into a read-only rtable, while a new wtable is generated to continue to accept write operations. The rtable will be brushed to disk by an asynchronous thread. The Get operation will query wtable first. If you can't find it, go to rtable to find it. If you can't find rtable, go to the disk file to find it.
Because wtable supports multithreaded reads and writes, access to it requires locking control. While rtable is read-only, it is not needed, but its existence time is very short, once rtable is generated, it will soon be serialized to disk by asynchronous wire program, and then it will be empty. But asynchronous line serialization also takes some time. What if wtable grows too fast and soon becomes full, and rtable hasn't been serialized yet, and wtable is in urgent need of transformation? At this point, the writer thread blocks waiting for the asynchronous line serialization to complete, which is one of the stutter points of LevelDB and the optimization point of RocksDB in the future.
There is also a log file in the figure, which records the log of recent writes. If LevelDB encounters a sudden outage, wtable and rtable data without persistence will be lost. At this point, the lost data must be recovered by replaying the instruction data in the log file. Notice that there are also two log files, which correspond to the memory hop list. When wtable is about to change, so does the log file. After the rtable is successfully installed, the read-only log files can be deleted.
Disk structure
LevelDB stores a lot of sst files on disk, sst means Sorted String Table, and all the Key in the file will be in order. Each file corresponds to a level, and each level has multiple files. The underlying file content comes from the upper layer, eventually they all come from the layer 0 file, and the layer 0 file comes from the rtable serialization in memory. A rtable will be serialized into a complete layer 0 file. This is what we called "sinking" earlier.
Serializing the rtable from memory into 0-tier sst files is called "Minor Compaction", and sinking from n-tier sst files to nasty 1-tier sst files is called "Major Compaction". The reason for this distinction is that Minor is fast and consumes less resources, and it is done by serializing the rtable completely into an sst file. On the other hand, Major will involve the merge operation between multiple files, which consumes a lot of resources and is slow. The deeper the level, the greater the total capacity of the file. There is a hierarchical capacity formula in the LevelDB source code, which shows an exponential relationship between capacity and level. Usually, each sst file is about the same size, and the difference is that the number of files in each layer is different.
Capacity = level > 0 & & 10 ^ (level+1) M
The Key in each file is ordered, which means that there is a definite range of Key values within it. The obvious difference between layer 0 files and other layer files is that the ranges of files within other layers do not overlap, and they are strictly segmented according to the order of Key. The contents of layer 0 files are directly from the memory dump, so the range of Key values of multiple files in layer 0 will overlap.
When there is a read miss in memory to search the disk, it will first search from layer 0, and then go deeper if it cannot be found.
If it is at other levels, the search can be fast, because you can quickly determine which file it might be in based on the scope of the Key. But for layer 0, because the file Key scope overlaps, it may exist in multiple files, so you need to search for these multiple files. For this reason, LevelDB limits the number of layer 0 files, if the number exceeds the default of 4, you need to "sink" to layer 1, this "sink" operation is Major Compaction.
The Key value range, hierarchy, and other meta-information of all files are stored in the MANIFEST file in the database directory. When the database is open, read this file to know the level of all files and the range of Key values.
The MANIFEST file also has a version number, which is reflected in the file name such as MANIFEST-000361. Each time you reopen the database, a new MANIFEST file is generated with a different version number, and then the old MANIFEST file needs to be deleted.
There is another file CURRENT in the database directory, and its content is very simple, which is the file name of the current MANIFEST. LevelDB first reads the CURRENT file before it knows which MANIFEST file is valid. In the event of a power outage, there will be a small probability of intermediate state, and the new and old MANIFEST files coexist in the database directory.
We know that the database directory of LevelDB does not allow multiple processes to access it at the same time, so how does it prevent other processes from accidentally reading and writing to this directory file? If you look closely at the database directory, you will also find a file called LOCK, which is the key to controlling multi-process access to the database. When a process opens a database, a mutex lock is added to the file, and when the process ends, the lock is automatically released.
There is also the last less important operation log file, LOG, which records a series of key operation logs of the database, such as information about each Minor and Major Compaction.
Multi-path merging
Compaction is a resource-consuming operation. In order not to affect online read and write operations, LevelDB hands over Compaction work to a single asynchronous thread. If the workload is huge, this single asynchronous thread will be a little too much to bear. When asynchronous threads are overwhelmed, the read and write operations of online memory will also be affected. Because only when the rtable sinks to the disk can the wtable change. Only when the wtable is transformed will a new wtable be created to accommodate more subsequent key-value pairs. In short, it is a set of links, linked to each other.
Let's take a look at Compaction. Minor Compaction is easy to understand, but the content space is limited, so you need to dump the data in rtable to a disk layer 0 file. So why do you need to sink from layer 0 file Compact to layer 1 file? Because if there are too many layer 0 files, it will affect the search efficiency. We mentioned earlier that the Key range between layer 0 files will overlap, so a single Key may exist in multiple files, and the number of IO reads will be magnified by the number of files. Through Major Compaction, the number of layer 0 files can be reduced and the reading efficiency can be improved. Does that just need to sink to the first layer of file? So why on earth does LevelDB need so many levels?
Assuming that LevelDB has only 2 layers (0 layer and 1 layer), then over time, layer 1 is sure to accumulate a large number of files. When layer 0 files need to sink, that is, Major Compaction is coming, suppose only a layer 0 file is sunk, it is not simply to change the number of layers of file meta-information from 0 to 1. It needs to continue to maintain the order of layer 1 files, and there should be no overlap in the range of Key values in each file. It can not directly insert or append the key-value pairs in layer 0 files to all files in layer 1, because sst files are compact storage, the insertion operation must involve the movement of disk blocks. In addition, there is a delete operation, which needs to kill some deleted key-value pairs in the layer 1 file to prevent them from taking up space continuously.
So how on earth does LevelDB do it? It uses multi-way merging algorithm, takes the relevant 0-layer files and 1-layer sst files as input, merges multiple 1-layer sst files, and then kills the old sst files, and generates new MANIFEST files at the same time. For each layer 0 file, it searches for sst files that overlap with its range in the layer 1 file according to the range of Key values. If there are too many files in layer 1 and too many files are involved in each multiplex merge, the merging algorithm will be very resource-consuming. So LevelDB also needs to control the number of files in Tier 1. When Tier 1 is full, it will continue to sink to Tier 2, Tier 3, Tier 4, etc.
The resource consumption of non-0 layer multiplex merging is less, because the Key value range of a single file is limited, the number of files that can be covered to the next layer is limited, and the number of input files participating in multiplex merging is much less. However, there is a loophole in this logic, that is, there is a 10-fold gap in the number of files between the upper and lower levels, which means that the average range of values of one file in the upper layer will cover 10 files in the next layer. Therefore, the resource consumption of non-zero layer multiplex merging is actually not low, and Major Compaction is a more resource-consuming operation.
After reading this, the article "what is the overall structure of LevelDB" has been introduced. If you want to master the knowledge points of this article, you still need to practice and use it yourself. If you want to know more about related articles, welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.