The inside story of TiDB technology-computing 07/16 Update SLTechnology News&Howtos

The inside story of TiDB technology-computing

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Mapping from Relational Model to Key-Value Model

Here we simply understand the relational model as Table and SQL statements, then the problem becomes how to save the Table on the KV structure and how to run the SQL statement on the KV structure. Suppose we have a definition of a table like this:

CREATE TABLE User {ID int, Name varchar (20), Role varchar (20), Age int, PRIMARY KEY (ID), Key idxAge (age)}

There is a huge difference between SQL and KV structures, so how to map conveniently and efficiently has become a very important problem. A good mapping scheme must be conducive to the need for data manipulation. So let's first take a look at the requirements and characteristics of the operation of the data.

For a Table, the data to be stored consists of three parts:

Table meta-information Row index data in Table

We will not discuss the meta-information of the table for the time being, there will be a special chapter to introduce it. For Row, you can choose row storage or column storage, both of which have their own advantages and disadvantages. The primary goal of TiDB is OLTP business, which needs to support reading, saving, modifying and deleting a row of data quickly, so it is more appropriate to use row storage.

For Index,TiDB, you need to support not only Primary Index, but also Secondary Index. The role of Index is to assist queries, improve query performance, and guarantee some Constraint. There are two modes when querying, one is point lookup, such as querying through the equivalent conditions of Primary Key or Unique Key, such as select name from user where id=1;, which needs to quickly locate a row of data through an index, and the other is Range query, such as select name from user where age > 30 and age

< 35;，这个时候需要通过idxAge索引查询 age 在 20 和 30 之间的那些数据。Index 还分为 Unique Index 和非 Unique Index，这两种都需要支持。分析完需要存储的数据的特点，我们再看看对这些数据的操作需求，主要考虑 Insert/Update/Delete/Select 这四种语句。对于 Insert 语句，需要将 Row 写入 KV，并且建立好索引数据。对于 Update 语句，需要将 Row 更新的同时，更新索引数据（如果有必要）。对于 Delete 语句，需要在删除 Row 的同时，将索引也删除。上面三个语句处理起来都很简单。对于 Select 语句，情况会复杂一些。首先我们需要能够简单快速地读取一行数据，所以每个 Row 需要有一个 ID （显示或隐式的 ID）。其次可能会读取连续多行数据，比如 Select * from user;。最后还有通过索引读取数据的需求，对索引的使用可能是点查或者是范围查询。大致的需求已经分析完了，现在让我们看看手里有什么可以用的：一个全局有序的分布式 Key-Value 引擎。全局有序这一点重要，可以帮助我们解决不少问题。比如对于快速获取一行数据，假设我们能够构造出某一个或者某几个 Key，定位到这一行，我们就能利用 TiKV 提供的 Seek 方法快速定位到这一行数据所在位置。再比如对于扫描全表的需求，如果能够映射为一个 Key 的 Range，从 StartKey 扫描到 EndKey，那么就可以简单的通过这种方式获得全表数据。操作 Index 数据也是类似的思路。接下来让我们看看 TiDB 是如何做的。 TiDB 对每个表分配一个 TableID，每一个索引都会分配一个 IndexID，每一行分配一个 RowID（如果表有整数型的 Primary Key，那么会用 Primary Key 的值当做 RowID），其中 TableID 在整个集群内唯一，IndexID/RowID 在表内唯一，这些 ID 都是 int64 类型。每行数据按照如下规则进行编码成 Key-Value pair： Key： tablePrefix_rowPrefix_tableID_rowID Value: [col1, col2, col3, col4] 其中 Key 的 tablePrefix/rowPrefix 都是特定的字符串常量，用于在 KV 空间内区分其他数据。对于 Index 数据，会按照如下规则编码成 Key-Value pair： Key: tablePrefix_idxPrefix_tableID_indexID_indexColumnsValue Value: rowID Index 数据还需要考虑 Unique Index 和非 Unique Index 两种情况，对于 Unique Index，可以按照上述编码规则。但是对于非 Unique Index，通过这种编码并不能构造出唯一的 Key，因为同一个 Index 的 tablePrefix_idxPrefix_tableID_indexID_ 都一样，可能有多行数据的 ColumnsValue 是一样的，所以对于非 Unique Index 的编码做了一点调整： Key: tablePrefix_idxPrefix_tableID_indexID_ColumnsValue_rowID Value：null 这样能够对索引中的每行数据构造出唯一的 Key。注意上述编码规则中的 Key 里面的各种 xxPrefix 都是字符串常量，作用都是区分命名空间，以免不同类型的数据之间相互冲突，定义如下： var( tablePrefix = []byte{'t'} recordPrefixSep = []byte("_r") indexPrefixSep = []byte("_i") ) 另外请大家注意，上述方案中，无论是 Row 还是 Index 的 Key 编码方案，一个 Table 内部所有的 Row 都有相同的前缀，一个 Index 的数据也都有相同的前缀。这样具体相同的前缀的数据，在 TiKV 的 Key 空间内，是排列在一起。同时只要我们小心地设计后缀部分的编码方案，保证编码前和编码后的比较关系不变，那么就可以将 Row 或者 Index 数据有序地保存在 TiKV 中。这种保证编码前和编码后的比较关系不变的方案我们称为 Memcomparable，对于任何类型的值，两个对象编码前的原始类型比较结果，和编码成 byte 数组后（注意，TiKV 中的 Key 和 Value 都是原始的 byte 数组）的比较结果保持一致。具体的编码方案参见 TiDB 的 codec 包。采用这种编码后，一个表的所有 Row 数据就会按照 RowID 的顺序排列在 TiKV 的 Key 空间中，某一个 Index 的数据也会按照 Index 的 ColumnValue 顺序排列在 Key 空间内。现在我们结合开始提到的需求以及 TiDB 的映射方案来看一下，这个方案是否能满足需求。首先我们通过这个映射方案，将 Row 和 Index 数据都转换为 Key-Value 数据，且每一行、每一条索引数据都是有唯一的 Key。其次，这种映射方案对于点查、范围查询都很友好，我们可以很容易地构造出某行、某条索引所对应的 Key，或者是某一块相邻的行、相邻的索引值所对应的 Key 范围。最后，在保证表中的一些 Constraint 的时候，可以通过构造并检查某个 Key 是否存在来判断是否能够满足相应的 Constraint。至此我们已经聊完了如何将 Table 映射到 KV 上面，这里再举个简单的例子，便于大家理解，还是以上面的表结构为例。假设表中有 3 行数据： "TiDB", "SQL Layer", 10"TiKV", "KV Engine", 20"PD", "Manager", 30 那么首先每行数据都会映射为一个 Key-Value pair，注意这个表有一个 Int 类型的 Primary Key，所以 RowID 的值即为这个 Primary Key 的值。假设这个表的 Table ID 为 10，其 Row 的数据为： t_r_10_1 -->

["TiDB", "SQL Layer", 10] t_r_10_2-- > ["TiKV", "KV Engine", 20] t_r_10_3-> ["PD", "Manager", 30]

In addition to Primary Key, the table also has an Index. Assuming that the ID of this Index is 1, the data is:

T_i_10_1_10_1-- > null t_i_10_1_20_2-- > null t_i_10_1_30_3-- > null

You can understand this example with the coding rules above. I hope you can understand why we chose this mapping scheme and what is the purpose of doing so.

Meta-information management

The previous section described how the data and indexes in the table are mapped to KV, and this section describes the storage of meta-information. Database/Table has meta-information, that is, its definition and attributes, which also need to be persisted, and we also store this information in TiKV. Each Database/Table is assigned a unique ID, the ID as the unique identity, and when encoded as Key-Value, the ID is encoded into the Key, prefixed with m _. This constructs a Key,Value that stores serialized meta-information. In addition, there is a special version of Key-Value that stores current Schema information. TiDB uses Google F1's Online Schema change algorithm, and a background thread constantly checks whether the Schema version stored on TiKV has changed, and ensures that it will be able to get the version change within a certain period of time (if it does). For the specific implementation of this section, see the article on the asynchronous schema change implementation of TiDB.

SQL on KV architecture

The overall architecture of TiDB is shown in the following figure

The main role of TiKV Cluster is to store data as a KV engine, which has been described in detail in the previous article and will not be covered here. This article mainly introduces the SQL layer, that is, the TiDB Servers layer, where the nodes are stateless, do not store data, and are completely peer-to-peer. The most important work of the TiDB Server layer is to handle user requests and perform SQL operation logic. Let's do some brief introduction next.

SQL operation

After understanding the mapping scheme from SQL to KV, we can understand how relational data is saved, and then we need to understand how to use this data to meet the query needs of users, that is, how a query statement manipulates the underlying stored data. The simplest solution you can think of is to map the SQL query to the KV query through the mapping scheme described in the previous section, and then obtain the corresponding data through the KV interface, and finally perform various calculations. For example, Select count (*) from user where name= "TiDB"; such a statement, we need to read all the data in the table, then check whether the Name field is TiDB, and if so, return this row. Such an operation flow is transformed into a KV operation process:

Construct Key Range: all RowID in a table are in the range of [0, MaxInt64), then we can construct a left closed and right open interval scan Key Range of [StartKey, EndKey) according to Row's Key coding rules with 0 and MaxInt64: read and filter the data in TiKV according to the Key Range constructed above: calculate the expression name= "TiDB" for each row of data read, if it is true Then return this row up, otherwise discard this row of data to calculate Count: for each row that meets the requirements, accumulating above the Count value, this scheme can certainly Work, but can not Work very well, the reason is obvious: when scanning data, each row must be read out with the TiKV through the KV operation, at least once RPC overhead, if a lot of data need to be scanned Well, this cost will be very large, and not all rows will be useful. If the conditions are not met, it is meaningless not to read the values of rows that meet the requirements. In fact, there are only a few rows of data here for distributed SQL operations.

How to avoid the above defects is also obvious, first of all, we need to calculate as close to the storage node as possible to avoid a large number of RPC calls. Second, we need to push down the Filter to the storage node for calculation, so that only valid rows need to be returned to avoid meaningless network traffic. Finally, we can push the aggregate function and GroupBy down to the storage node for pre-aggregation. Each node only needs to return a Count value, and then tidb-server will Sum the Count value. Here is a schematic diagram of the data returned layer by layer:

Here is an article detailing how TiDB makes SQL statements run faster, which you can refer to.

SQL layer architecture

The above sections briefly introduce some of the functions of the SQL layer. I hope you will have a basic understanding of the handling of SQL statements. In fact, the SQL layer of TiDB is much more complex, with many modules and layers. The following figure lists the important modules and invocation relationships:

The user's SQL request will be sent to tidb-server,tidb-server directly or through Load Balancer, the MySQL Protocol Packet will be parsed, the request content will be obtained, and then syntax parsing, query plan formulation and optimization, query plan execution to obtain and process data. All the data is stored in the TiKV cluster, so tidb-server needs to interact with tikv-server to get the data in the process. Finally, tidb-server needs to return the query results to the user.

Summary

So far, we have seen how the data is stored and used for calculation from the perspective of SQL. A more detailed introduction to the SQL layer will be given in future articles, such as how the optimizer works and the details of the distributed execution framework. In the next article, we will introduce some information about PD, which will be interesting because a lot of things are not seen in the process of using TiDB, but are very important to the overall cluster. It mainly involves the management and scheduling of the cluster.

-- end [Tony.Tang] 2018.3.8 Murray-

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.