Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Comparison of TIDB Architecture and distributed protocols Paxos and Raft

2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Recently, newsql is very hot, especially TIDB. Pingcap continues to polish TiDB. Now the version has been iterated to 3.0, and the product has basically matured.

For TiDB, the overall architecture diagram is shown below

It is composed of four modules, TiDB Server,PD Server,TiKV Server,TiSpark.

TiDB Server is responsible for accepting SQL requests, processing the relevant logic of SQL, finding the TiKV address where the data needed for calculation is stored through PD, interacting with TiKV to obtain the data, and finally returning the result. TiDB Server is stateless, does not store data, is only responsible for computing, can be expanded infinitely, and can provide a unified access address through load balancing components (such as LVD,HAPROXY,F5, etc.). It is recommended to deploy two instances. The front end provides services through the load balancer component. When a single instance fails, it will affect the ongoing session on this instance. From the application's point of view, a single request will fail. After reconnecting, you can continue to get the service. Placement Driver (abbreviated as PD) is the management module of the whole cluster, which has three main tasks: one is to store the meta-information of the cluster (which TiKV node a Key is stored on); the second is to schedule and load balance the TiKV cluster (such as data migration, Raft group leader migration, etc.), and the third is to allocate a globally unique and incremental transaction ID. PD ensures the security of data through Raft protocol. Raft's leader server handles all operations, and the rest of the PD server is only used to ensure high availability. When a single instance fails, if the instance is not the leader of Raft, the service will not be affected at all; if the instance is the leader of Raft, a new Raft leader will be re-selected and the service will be automatically restored. PD is unable to provide services during the election process, which takes about three seconds. It is recommended that you deploy at least three PD instances. TiKV Server is responsible for storing data. From an external point of view, TiKV is a distributed Key-Value storage engine that provides transactions. The basic unit of data storage is Region. Each Region is responsible for storing data of one Key Range (the left-closed and right-open interval from StartKey to EndKey), and each TiKV node is responsible for multiple Region. TiKV uses Raft protocol for replication to maintain data consistency and disaster recovery. Replicas are managed in units of Region, and multiple Region on different nodes form a Raft Group, which is a copy of each other. The load balancing of data among multiple TiKV is scheduled by PD, which is also scheduled on a per-Region basis. TiKV is a cluster that maintains data consistency through the Raft protocol (the number of replicas is configurable and three replicas are kept by default), and load balancing is scheduled through PD. When a single node fails, all Region stored on that node is affected. For the Leader node in Region, the service will be interrupted and wait for re-election; for the Follower node in Region, the service will not be affected. When a TiKV node fails and cannot be recovered for a period of time (the default is 30 minutes), PD migrates the data on it to other TiKV nodes. TiSpark, as the main component of TiDB to solve the complex OLAP needs of users, runs Spark SQL directly on the TiDB storage layer, integrates the advantages of TiKV distributed cluster, and integrates into big data community ecology. At this point, TiDB can support OLTP and OLAP through a system at the same time, eliminating the IO loss of user data synchronization.

Distributed protocols Paxos and Raft

Algorithm evolution process

Paxos

Paxos algorithm is a consistent algorithm based on message passing proposed by Leslie Lamport in 1990. Because the algorithm is difficult to understand, it did not attract people's attention at first. Lamport republished his paper on TOCS in 1998, but even so, Paxos algorithm was not taken seriously. In 2001, Lamport used a more readable narrative language to describe the algorithm.

Google published three papers in 2006, in which Paxos was used as the consistency algorithm in Chubby Cell in the Chubby lock service, and the popularity of Paxos soared ever since.

The biggest difference between the data synchronization based on Paxos protocol and the traditional active / standby mode is that only more than half of the copies of Paxos are online and communicate with each other normally to ensure the continuous availability of the service and no data loss.

Basic-Paxos

The problem solved by Basic-Paxos: how to agree on a proposal in a distributed system.

It needs to be implemented with two-phase commit:

Prepare phase:

Proposer selects a proposal number n and sends the prepare request to Acceptor.

After Acceptor receives the prepare message, if the number of the proposal is greater than all the prepare messages it has replied to, Acceptor will reply to Proposer the proposal it accepted last time and promise not to reply to the proposal less than n.

Accept phase:

When a Proposer receives most Acceptor responses to prepare, it enters the approval stage. It sends an prepare request to the Acceptor that replies to the accept request, including the number n and the value determined according to the prepare phase (if there is no accepted value based on the prepare, then it is free to decide the value).

Without breaking its promise to other Proposer, Acceptor accepts the accept request as soon as it receives it.

Mulit-Paxos

The problem solved by Mulit-Paxos: how to agree on a batch of proposals in a distributed system.

When there are a batch of proposals, it is possible to use Basic-Paxos to decide one by one, but each proposal goes through two stages of submission, which is obviously inefficient. The implementation process of the Basic-Paxos protocol has at least three network interactions for each proposal (each redo log): 1. Produce log ID;2. Prepare stage; 3. Accept stage.

Therefore, Mulit-Paxos is optimized based on Basic-Paxos, and uses Paxos protocol to elect the unique leader in the Paxos cluster. During the validity period of leader, all proposals can only be initiated by leader.

This reinforces the assumption of the agreement that there will be no other server proposals during the validity of the leader. Therefore, for subsequent proposals, we can simplify the generation of the log ID phase and the Prepare phase, and instead, the only leader generates the log ID, and then directly implements the Accept, which means that the proposal is agreed upon by the majority (each redo log can correspond to one proposal).

Related products

X-DB, OceanBase and Spanner all use Multi-Paxos to ensure data consistency.

Multi-Paxos is also used in MySQL Group Replication's xcom.

PaxosStore

PaxosStore is a distributed consistency middleware implemented by Tencent WXG based on Paxos. It is widely used in Wechat user account management, user relationship management, even messaging, social network, online payment and other scenarios.

Raft

Raft, a consistent algorithm designed by Diego Ongaro and John Ousterhout of Stanford University, published a paper called "In Search of an Understandable Consensus Algorithm" in 2013. Since its release in 2013, there has been a framework for implementing Raft algorithms in more than a dozen languages, and etcd,Google 's Kubernetes has also used etcd as its service discovery framework.

Compared with Paxos, Raft emphasizes that it is easy to understand and easy to implement. Like Paxos, Raft can provide services as long as more than half of the nodes are normal.

As we all know, when the problem is more complex, the problem can be divided into several small problems. Raft also uses the idea of divide and conquer and divides the algorithm into three sub-problems:

Election (Leader election)

Log replication (Log replication)

Security (Safety)

After decomposition, the whole raft algorithm becomes easy to understand, easy to demonstrate and easy to implement.

Related products

Etcd uses Raft to ensure data consistency.

Mulit-Raft

The upper limit of the data size of many NewSQL databases is above 100TB, and data is sharding for load balancing, so you need to use multiple Raft clusters (i.e. Multi-Raft), and each Raft cluster corresponds to a shard.

Collaboration can be increased among multiple Raft clusters to reduce resource overhead and improve performance (such as sharing communication links, merging messages, etc.).

Related products

TiDB, CockroachDB and PolarDB all use Mulit-Raft to ensure data consistency.

The difference between Raft and Multi-Paxos

Raft is based on two restrictions on Multi-Paxos:

The requests sent are continuous, that is, the append operations of Raft must be continuous, while Paxos can be concurrent (concurrency here is only append log concurrency, applied to the state machine or orderly).

The selection of Raft master is limited, and the node that contains the latest and most full log must be selected as leader. However, Multi-Paxos does not have this restriction, and nodes with incomplete logs can also become leader.

Raft can be thought of as a simplified version of Multi-Paxos.

Multi-Paxos allows concurrent writing of log, and when the leader node fails, the remaining nodes may have log holes. So after selecting a new leader, you need to complete the log that is not in the new leader and apply it to the state machine in turn.

Raft election process

In Raft protocol, a node has three states: Leader, Follower and Candidate, but it can only be in one state at a time. Raft election actually refers to the election Leader, the election is initiated by the candidate (Candidate), not by other third parties.

And only Leader can accept write and read requests, and only Candidate can initiate elections. If a Follower and its Leader are lost (for more than one Term), it is automatically converted to Candidate and an election is initiated.

The purpose of initiating the election is to Candidate request (Request) all other nodes to vote for itself. If the Candidate gets the vote (Votes) of the majority node (a majority of nodes), it automatically becomes Leader. This process is called Leader election.

In the Raft protocol, Leader normally sends AppendEntries RPC to all nodes periodically (not greater than Term) to maintain its Leader status.

Accordingly, if a Follower does not receive an AppendEntries RPC from Leader within a Term, it initiates an election to all other nodes after delaying the random time (150ms~300ms).

The purpose of adopting random time is to avoid multiple Followers initiating elections at the same time, while initiating elections at the same time can easily cause all Candidates to fail to get a majority of Followers votes (brain cleavage, for example, each getting half of the votes, no one has a majority, resulting in invalid elections requiring re-election), so delaying random time can improve the success of an election.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 268

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report