The inside story of TiDB technology-storage 07/06 Update SLTechnology News&Howtos

The inside story of TiDB technology-storage

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Introduction

Database, operating system and compiler are called three major systems, which can be said to be the cornerstone of the whole computer software. Among them, the database is closer to the application layer, which is the support of many businesses. After decades of development, new progress has been made in this field.

Many people have used a database, but few have implemented a database, especially a distributed database. Understanding the implementation principles and details of the database, on the one hand, can improve personal technology and help to build other systems, on the other hand, it is also conducive to the good use of the database.

The best way to study a technology is to study one of the open source projects, and the database is no exception. But when it comes to distributed databases, there are not many good open source projects. TiDB has received a lot of attention, especially some technology enthusiasts who want to participate in the project. Because of the complexity of distributed database itself, many people can not understand the whole project very well, so I hope to write some articles, from top to top, from shallow to deep, about some technical principles of TiDB, including technologies visible to users and a large number of technical points hidden behind the SQL interface.

Save data

The most basic function of the database is to save the data, so let's start here.

There are many ways to save data, and the easiest way is to build a data structure directly in memory to save the data sent by users. For example, use an array to append a record to the array every time a piece of data is received. This solution is very simple, can meet the most basic, and the performance will certainly be very good, but in addition, it is full of loopholes, the biggest problem is that the data is completely in memory, once the downtime or service restart, the data will be lost permanently.

To solve the problem of data loss, we can put the data in a non-volatile storage medium (such as a hard disk). The improved solution is to create a file on disk, receive a piece of data, and Append a line in the file. OK, we now have a solution for persisting data storage. But it's not good enough. Suppose there's something wrong with this disk? We can do RAID (Redundant Array of Independent Disks) to provide stand-alone redundant storage. What if the whole machine is down? For example, in the event of a fire, RAID can't keep the data. We can also switch storage to network storage, or store and copy it through hardware or software. It seems that we have solved the data security problem here, so we can breathe a sigh of relief. But, can you ensure the consistency between replicas in the process of replication? That is, under the premise of ensuring that the data is not lost, but also to ensure that the data is good. Ensuring that the data is not lost is just one of the most basic requirements, and there are more headaches waiting to be solved:

Can disaster recovery across data centers be supported? Is the writing speed fast enough? After the data is saved, is it easy to read? How to modify the saved data? How to support concurrent modifications? How to modify multiple records atomically?

Each of these problems is very difficult, but in order to make an excellent data storage system, each of the above problems must be solved. In order to solve the problem of data storage, we developed the project TiKV. Next, I would like to introduce some of the design ideas and basic concepts of TiKV.

Key-Value

As a system for saving data, the first thing to decide is the storage model of the data, that is, in what form the data is saved. The choice of TiKV is the Key-Value model and provides ordered traversal methods. In a nutshell, think of TiKV as a huge Map, where Key and Value are both raw Byte arrays, and in this Map, Key is arranged in the order of comparison of the total raw binary bits of the Byte array. Here are two things to remember about TiKV:

This is a huge Map, that is, what is stored is that the Key-Value pair in Key-Value pair, the Map, is ordered according to the binary order of Key, that is, we can Seek to a location of a Key, and then constantly call the Next method to get the Key-Value larger than this Key in incremental order.

Having said so much, one might ask, what is the relationship between the storage model here and the tables in SQL? Here is an important thing to say four times:

The storage model here has nothing to do with Table in SQL! The storage model here has nothing to do with Table in SQL! The storage model here has nothing to do with Table in SQL! The storage model here has nothing to do with Table in SQL!

Now let's forget any concepts in SQL and focus on how to implement TiKV, a huge (distributed) Map with high performance and high reliability.

RocksDB

In any persistent storage engine, the data will eventually be saved on disk, and TiKV is no exception. But TiKV does not choose to write data directly to disk, but saves the data in RocksDB, and RocksDB is responsible for the specific data landing. The reason for this choice is that there is a lot of work to develop a stand-alone storage engine, especially to do a high-performance stand-alone engine, which requires a variety of detailed optimizations. RocksDB is a very excellent open source stand-alone storage engine, which can meet our various requirements for a stand-alone engine, and Facebook's team is doing continuous optimization, so we put very little effort into it. You can enjoy a very powerful and improving stand-alone engine. Of course, we have also contributed some code to RocksDB, and we hope this project will get better and better. Here you can simply think of RocksDB as a stand-alone Key-Value Map.

Raft

Well, the first step of the long march has been taken, and we have found an efficient and reliable local storage solution for the data. As the saying goes, everything is difficult at the beginning, then in the middle, and at the end. Next, we are faced with a more difficult thing: how to ensure that data is not lost and errors are not made in the case of stand-alone failure? To put it simply, we need to find a way to copy data to multiple machines, so that one machine is dead, and we have copies on other machines; in complex terms, we also need this replication scheme to be reliable, efficient and able to handle replica failures. It sounds difficult, but the good news is that we have a Raft agreement. Raft is a consistent algorithm that is equivalent to Paxos, but easier to understand. Here is Raft's paper. If you are interested, you can have a look at it. This article will only make a brief introduction to Raft. Please refer to the paper for details. In addition, the Raft paper is only a basic scheme, strictly according to the paper implementation, the performance will be very poor, we have done a lot of optimization on the implementation of the Raft protocol, the specific optimization details can refer to our chief architect tangliu this article.

Raft is a conformance protocol that provides several important functions:

Leader elected member change log replication

TiKV uses Raft to do data replication, and each data change will be a Raft log. Through the log replication function of Raft, the data can be safely and reliably synchronized to most nodes of Group.

Let's sum up here that through stand-alone RocksDB, we can quickly store data on disk; with Raft, we can copy data to multiple machines to prevent stand-alone failure. Data is written through the interface of the Raft layer, rather than writing RocksDB directly. By implementing Raft, we have a distributed KV, and now we no longer have to worry about a machine dying.

Region

At this point, we can mention a very important concept: Region. This concept is the basis for understanding the next series of mechanisms, please read this section carefully.

The data mentioned here is scattered across multiple machines and Raft data replication is not the same concept. In this section, let's forget Raft and assume that all data has only one copy, which is easier to understand.

For a KV system, there are two typical schemes to distribute data on multiple machines: one is to Hash according to Key, and the other is to select the corresponding storage node according to the Hash value; the other is sub-Range, in which a continuous Key is stored on a storage node. TiKV chooses the second way to divide the entire Key-Value space into many segments, each of which is a series of consecutive Key. We call each segment a Region, and we will try our best to keep the data stored in each Region no more than a certain size (this size is configurable, and the default is 64mb). Each Region can be described as a left-closed and right-open interval from StartKey to EndKey.

Note that the Region here still has nothing to do with the table in SQL! Please continue to forget SQL and just talk about KV. After dividing the data into Region, we will do two important things:

Distribute the data on all nodes in the cluster in Region units, and try to ensure that the number of Region served on each node is almost the same as Region for Raft replication and membership management.

These two points are very important. Let's talk about them one by one.

Let's take a look at the first point. The data is divided into many Region according to Key, and the data of each Region is saved on only one node. Our system will have a component that is responsible for distributing Region evenly across all nodes in the cluster, so that on the one hand, the storage capacity is expanded horizontally (after adding new nodes, the Region on other nodes is automatically dispatched), on the other hand, it also implements load balancing (there will not be a lot of data on one node and no data on other nodes). At the same time, in order to ensure that the upper client can access the required data, our system will also have a component to record the distribution of Region on the node, that is, through any Key, we can find out which Region the Key is in and which node the Region is currently on. As to which component is responsible for these two tasks, it will be introduced later.

One of the Replica will be the Leader of the Group, and the other Replica will be the Follower. All reads and writes are done through Leader, and then copied to Follower by Leader. After you understand Region, you should be able to understand the following picture:

When we use Region as a unit to distribute and copy data, we have a distributed KeyValue system with a certain disaster recovery capability, and we no longer have to worry about whether the data can be stored or the data is lost due to disk failure. It's already Cool, but it's not perfect. We need more features.

MVCC

Many databases implement multiple version control (MVCC), and TiKV is no exception. Imagine a scenario in which two Client modify the Value of a Key at the same time. If there is no MVCC, the data needs to be locked. In a distributed scenario, it may bring performance and deadlock problems. The MVCC implementation of TiKV is achieved by adding Version after Key. To put it simply, before MVCC, you can think of TiKV as like this:

Key1-> Value Key2-> Value. KeyN-> Value

With MVCC, the Key arrangement of TiKV looks like this:

Key1-Version3-> Value Key1-Version2-> Value Key1-Version1-> Value. Key2-Version4-> Value Key2-Version3-> Value Key2-Version2-> Value Key2-Version1-> Value. KeyN-Version2-> Value KeyN-Version1-> Value.

Note that for multiple versions of the same Key, we put the larger version number first and the smaller version number later (recall that the Key we introduced in the Key-Value section is arranged in order), so that when users get the Value through a Key + Version, the Key and Version can be constructed into the MVCC Key, that is, Key-Version. You can then directly Seek (Key-Version) to locate the first location greater than or equal to the Key-Version.

Business

TiKV's transactions use the Percolator model and do a lot of optimization. The details of the transaction will not be detailed here, you can refer to the paper and our other articles. Only one point is mentioned here. TiKV transactions use optimistic locks. Write conflicts will not be detected during transaction execution. Conflict detection will be done only during the commit process. Those parties who commit earlier will succeed in writing, and the other party will try to re-execute the entire transaction. When the write conflict of the business is not serious, the performance of this model is very good, such as randomly updating the data of a row in the table, and the table is very large. However, if the write conflict of the business is serious, the performance will be very poor. To take an extreme example, counter, multiple clients modify a small number of rows at the same time, resulting in serious conflicts, resulting in a large number of invalid retries.

Other

The next section describes how to build a SQL layer on top of KV's storage model.

-- end [Tony.Tang] 2018.3.8 Murray-

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.