The inside story of TiDB technology-on dispatching 07/06 Update SLTechnology News&Howtos

The inside story of TiDB technology-on dispatching

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Why is it necessary to schedule?

Recall some of the information mentioned in the first article. TiKV cluster is a distributed KV storage engine for TiDB databases. Data is replicated and managed in units of Region. Each Region will have multiple Replica (replicas). These Replica will be distributed on different TiKV nodes, where Leader is responsible for reading / writing, and Follower is responsible for synchronizing the raft log sent by Leader. After you know this information, think about the following questions:

How to ensure that multiple Replica of the same Region are distributed on different nodes? Further, what's the problem if you start multiple TiKV instances on a single machine? When a TiKV cluster is deployed across data centers for disaster recovery, how to ensure that a data center is offline and that multiple Replica of Raft Group will not be lost? After adding a node to the TiKV cluster, how do I move the data from other nodes in the cluster? What happens when a node goes offline? What does the whole cluster need to do? What if the node is only temporarily offline (restarting the service)? What should be done if the node is offline for a long time (disk failure, all data is lost)? Assuming that the cluster requires N replicas per Raft Group, the number of Replica may not be enough for a single Raft Group (for example, the node goes offline, the replica is lost), or too many (for example, the dropped nodes return to normal and automatically join the cluster). So how to adjust the number of Replica? Reading / writing is done through Leader, so what is the impact on the cluster if Leader is concentrated on only a small number of nodes? Not all Region are visited frequently, maybe only a few Region access hotspots, what do we need to do at this time? When a cluster is doing load balancing, it often needs to move data. Will this data migration take up a lot of network bandwidth, disk IO and CPU? Which in turn affects online services?

These problems alone may find a simple solution, but mixed together, it is not easy to solve. Some problems seem to only need to consider the internal situation of a single Raft Group, such as whether you need to add copies based on whether there are enough copies. But in fact, where this copy is added, you need to consider the global information. The whole system is also changing dynamically, such as Region splitting, node joining, node failure, access hotspot change and so on. The whole scheduling system also needs to move to the optimal state dynamically. If there is no global information, it is difficult to meet these requirements with global scheduling and configurable components. So we need a central node to control and adjust the overall state of the system, so we have the PD module.

Demand for scheduling

There are a lot of problems listed above. Let's sort them out and sort them out first. Generally speaking, there are two main types of problems:

As a distributed high-availability storage system, there are four requirements that must be met:

The number of replicas can be neither more nor less. After the copies need to be distributed on different machines to add new nodes, you can migrate the replicas from other nodes. After the nodes are offline, you need to migrate the data from that node.

As a good distributed system, it needs to be optimized, including:

Maintain the uniform Leader distribution of the whole cluster, maintain the uniform storage capacity of each node, maintain the uniform distribution of access hotspots and control the speed of Balance, and avoid affecting the status of online service management nodes, including manual online / offline nodes and automatic offline failure nodes.

After meeting the first type of requirements, the whole system will have the functions of multi-copy fault tolerance, dynamic capacity expansion / reduction, tolerance of node dropping and automatic error recovery. After meeting the second kind of requirements, the load of the whole system can be more uniform and can be easily managed.

In order to meet these requirements, first of all, we need to collect sufficient information, such as the status of each node, the information of each Raft Group, the statistics of business access operations, and so on; secondly, we need to set some policies, and based on these information and scheduling policies, PD works out a scheduling plan that meets the needs mentioned above as far as possible; finally, we need some basic operations to complete the scheduling plan.

Basic operation of scheduling

Let's first introduce the simplest point, that is, the basic operation of scheduling, that is, what functions we can use in order to meet the scheduling strategy. This is the basis of the whole dispatch. Only when you know what kind of hammer you have in your hand, do you know what posture to use to hit the nail.

The above scheduling requirements may seem complex, but there are only three things that can be sorted out and finally landed:

Add a Replica and delete a Replica to transfer the Leader role between different Replica of a Raft Group

It just so happens that Raft protocol can meet these three requirements, through AddReplica, RemoveReplica, TransferLeader these three commands, can support the above three basic operations.

Information collection

Scheduling depends on the collection of information throughout the cluster. To put it simply, we need to know the status of each TiKV node and the status of each Region. The TiKV cluster reports two types of messages to PD:

Each TiKV node regularly reports the overall information of the node to the PD.

There is a heartbeat packet between the TiKV node (Store) and the PD. On the one hand, PD detects whether each Store is alive and whether there is a new Store; through the heartbeat packet. On the other hand, the heartbeat packet also carries the status information of this Store, which mainly includes:

Total disk capacity available disk capacity number of Region data written speed sent / accepted number of Snapshot sent / accepted (data may be synchronized through Snapshot between Replica) whether the label information is overloaded (the label is a series of Tag with hierarchical relationship)

The Leader of each Raft Group reports information to PD on a regular basis.

There is a heartbeat between the Leader and PD of each Raft Group, which is used to report the status of the Region. It mainly includes the following information:

The location of Leader, the location of Followers, the number of Replica, the speed of data writing / reading.

PD constantly collects the information of the whole cluster through these two kinds of heartbeat messages, and then uses this information as the basis for decision-making. In addition, PD can accept additional information through the management interface to make more accurate decisions. For example, when the heartbeat packet of a Store is interrupted, the PD cannot determine whether the node is temporarily or permanently invalid. It can only wait for a period of time (the default is 30 minutes). If there is no heartbeat packet, it is considered that the Store has been offline, and then it is decided that all the Region on the Store needs to be dispatched. However, sometimes, the operation and maintenance personnel take the initiative to take a certain machine offline. At this time, you can notify PD through the management interface of PD that the Store is not available, and PD can immediately determine the need to dispatch all the Region on the Store.

Scheduling strategy

After PD has collected this information, it also needs some strategies to develop a specific scheduling plan.

The correct number of Replica for a Region

When PD finds that the number of Replica of a Region Leader does not meet the requirements through the heartbeat packet of the Region, the number of Replica needs to be adjusted through the Add/Remove Replica operation. The possible reasons for this are:

When a node is offline, all the above data is lost, resulting in a shortage of Replica in some Region. A node that is off-line resumes service and automatically connects to the cluster. In this way, the number of Replica in Region of Replica has been made up before, and it is necessary to delete a Replica administrator who has adjusted the replica policy and modified the configuration of max-replicas.

Multiple Replica in a Raft Group are not in the same location

Note that the second point, "multiple Replica in a Raft Group is not in the same location", is "same location" instead of "same node". In general, PD will only ensure that multiple Replica will not fall on one node, so as to avoid the loss of multiple Replica caused by the failure of a single node. In an actual deployment, the following requirements may also arise:

Multiple nodes are deployed on the same physical machine. TiKV nodes are distributed on multiple racks. It is hoped that when a single rack is powered off, the system availability can be ensured that TiKV nodes are distributed in multiple IDC, and when a single computer room is powered off, the system is also available.

In essence, these requirements are that a node has a common location attribute, which constitutes a smallest fault-tolerant unit. We hope that there will not be multiple Replica of a Region inside this unit. At this time, you can configure lables for the node and specify which lable is the location identity by configuring location-labels on the PD. It is necessary to ensure that there are no multiple Replica nodes of a Region that have the same location identification as far as possible when the Replica is allocated.

Distribution of replicas among Store evenly

As mentioned earlier, the upper limit of the data capacity stored in each replica is fixed, so we maintain a balance in the number of replicas on each node, which will make the overall load more balanced.

The number of Leader is evenly distributed among Store

In the Raft protocol, both reads and writes are done through Leader, so the load of calculation is mainly on Leader, and PD will spread Leader among nodes as much as possible.

The number of access hotspots is evenly distributed among Store

Each Store and Region Leader are reported with information about the current access load, such as the read / write speed of the Key. PD detects access hotspots and spreads them between nodes.

Each Store takes up roughly the same amount of storage space.

When each Store starts, a Capacity parameter is specified, indicating the upper limit of the storage space of the Store. When scheduling, PD will consider the remaining storage space of the node.

Control scheduling speed to avoid affecting online services

Scheduling operations require CPU, memory, disk IO and network bandwidth, so we need to avoid having too much impact on online services. PD controls the number of operations currently in progress. The default speed control is conservative. If you want to speed up scheduling (for example, if you have already stopped service upgrades, added new nodes, and want to schedule as soon as possible), you can manually speed up the scheduling through pd-ctl.

Support manual offline node

When the node is manually logged off through pd-ctl, PD will dispatch the data on the node under certain rate control. When the scheduling is complete, the node will be put offline.

Realization of scheduling

With the above information, let's take a look at the entire scheduling process.

PD constantly collects information through the heartbeat packets of Store or Leader to obtain the detailed data of the entire cluster, and generates a scheduling operation sequence based on this information and scheduling policy. Every time a heartbeat packet from Region Leader is received, PD will check whether there is an operation to be done on the Region. Through the reply message of the heartbeat package, the operation that needs to be done is returned to Region Leader, and the execution result is monitored in the subsequent heartbeat package. Note that the action here is only a suggestion to Region Leader, and there is no guarantee that it will be implemented. It is up to Region Leader to decide whether and when it will be executed according to its current status.

Summary

What is talked about in this article, you may rarely see in other articles that every design has its own considerations. I hope you can understand what needs to be considered when scheduling a distributed storage system. How to decouple the strategy and implementation to support the expansion of the strategy more flexibly.

Now that we have finished the three articles, I hope you can understand the basic concepts and implementation principles of the whole TiDB. We will write more articles to introduce more inside information about TiDB from the architecture and code level. If you have any questions, you are welcome to email to shenli@pingcap.com to communicate.

-- end [Tony.Tang] 2018.3.8 Murray-

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.