What is the principle of high availability of SequoiaDB? 07/08 Update SLTechnology News&Howtos

What is the principle of high availability of SequoiaDB?

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

What is the principle of high availability of SequoiaDB? for this question, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

one

It includes: master / standby structure and cluster architecture; learn about the famous RAFT algorithm; then most importantly, how is the distributed database of giant sequoia consistent, and how to ensure that the data is good and not lost in the cluster environment.

two

Review of High availability Technology of Database

The database system stores the business data of an IT system, which can be said to be the brain of IT system. The availability of the database system basically determines the availability of the IT system. If the database system is down, then the whole IT system will shut down, and even if the database is damaged, it may also lead to the loss of business data, resulting in great economic losses.

Traditional database systems, such as Oracle,DB2,SQL Server,MySQL, are designed for a stand-alone environment, and the usual architecture is a server plus disk array mode. In the production environment, even if very reliable minicomputers and high-end disk arrays are used, it is difficult to protect the database system from server outages / power outages / network failures / disk array failures. That is to say, it is certain that the database system fails in the process of running. What are we going to do? The brains of 0101 trained IT "engineers" are also relatively straight. A network will fail, then add another network; a server will go wrong, then another server; a disk array of data will go wrong, then add a disk array of data. "engineer": "our slogan is to eliminate a single point of failure and ensure that the database is available." PM: "more money!" .

Through the practice of "engineers", the high availability of traditional databases can be divided into three architectures: cold backup architecture / hot backup architecture / cluster architecture.

2.1 Cold standby architecture

Cold standby architecture is a kind of active and standby architecture, which eliminates single point of failure of network and server by adding redundant networks and servers, and adding cluster management software, such as IBM HACMP. As shown in the figure:

Database software and cluster management software are installed on the main and standby servers, and the disk array can be accessed through the SAN network. Under normal circumstances, the primary server starts the database process (the database process is not started on the standby server) and accesses the database stored in the disk array; the cluster management software starts on both the primary and standby servers, monitors the server / network / IO/ database process status and provides a virtual IP address (on the primary server) for provisioning access.

If the cluster management software finds that the primary server is unavailable (in many cases, such as power outage / network outage / memory CPU failure / inaccessible disk array / database process downtime, etc.), the switching process will be started. Prepare the server: "Hey hey, it's finally my turn to play, my feet are squatting numb."

Uninstall the disk array devices and file systems that were originally mounted on the primary server.

After testing the database file system, mount it to the standby server.

Cancel the virtual IP configured on the primary server and configure the virtual IP on the standby server.

Start the database process on the backup server; the database process checks the database log and resubmits or rolls back the transaction.

After the standby server database is normal, the service will be provided, and the switching is completed.

Advantages and disadvantages

Advantages: simple architecture and configuration, relatively low cost.

Disadvantages: you need to remount the file system, start the database process, check the database log, and the switching time is in minutes. If too many transactions need to be rolled back, it may take dozens of minutes, which is unbearable for businesses with high availability requirements; cold backup of backup server wastes a server resource; only one disk array is a single point; only one copy of data is a single point.

2.2 Hot standby Architectur

The hot standby architecture is another kind of the main and standby architecture. another disk array is added to the cold backup architecture to store more database data. and the database process of the standby server is started to continuously reuse the logs sent by the master server. The implementation of hot standby architecture includes: IBM DB2 HADR,Oracle DataGurd,MySQL Binglog Replication and so on.

As shown in the figure:

Under normal circumstances, the business application accesses the master server through the virtual IP, and the master server accesses the master database in its own disk array; the database process of the master server starts the log replication function to copy the transaction log to the database process of the standby server; the database process of the slave server reuses the log to the standby database to complete data synchronization.

If the primary server is not available, the cluster management software starts the switching process.

The cluster management software unconfigures the virtual IP of the primary server and configures the virtual IP to the standby server.

The switch operation is performed by the database process of the server, upgrading from standby to master.

After the database process of the standby server completes log reuse, it is available to provide services.

Advantages: because the database process of the slave server is started, and the transaction log is continuously reused, the switching time is short (second level); there are two copies of data, excluding a single point of disk array and data; standby servers generally cannot write but can provide read services. Disadvantages: a set of disk arrays has been added, which increases the cost.

2.3 Cluster architecture

The first two architectures solve the problem of high availability, but the database system can only scale up (increase the CPU and memory of a single server), not scale-out (increase the server) to achieve performance expansion.

Customer: "I want to scale out".

Engineer: "it's hard for me, but I can have it."

By providing the ability of shared storage in database systems, such as IBM DB2 pureScale,Oracle RAC, database processes on multiple servers can access database data on a shared storage in parallel.

As shown in the figure:

Under normal circumstances, business applications can connect to any database server to perform database read and write operations; the database process implements parallel read and write to the database in the shared disk array through parallel access control.

If a server is not available, applications connected to that server will automatically reconnect to another database server for high availability.

Advantages:

Short switching time (seconds); scale-out; load balancing ability

Disadvantages:

Scale-out capacity is limited, generally 2 servers in the majority of cases, more than 3 servers in few cases, too many servers will not increase but decline in performance; high cost; complex configuration management, DBA requires very high database management capabilities.

three

High availability Technology of Giant Sequoia distributed Database

Friends may ask: "traditional databases have achieved high availability and scale-out, why use distributed databases?"

Engineer: "answer. First of all, the scale-out capacity of traditional databases is limited, generally 2-3 servers, which is unable to cope with the current big data scenario. Secondly, it is expensive. Spend millions to build a core database with 2 minicomputers, high-end disk arrays and ORACLE RAC. It is fine for the sake of stability, but what about a database system that needs to handle hundreds of TB data and requires dozens of servers?"

Giant sequoia database: "it's my turn. No disk consolidation, no SAN network, use PC server plus built-in disk and distributed database software to achieve low-cost / high-availability / high-scalability / high-performance database system. Oye!"

Due to the use of PC servers and built-in disks, data consistency and high availability are the most important aspects of distributed database design. Here we will discuss the high availability implementation technology of the giant sequoia database. The high availability technology principle of the giant sequoia database is similar to the RAFT algorithm, which is optimized by using the RAFT election algorithm for reference in order to better support the distributed database scenario.

3.1 Raft algorithm

RAFT is an algorithm to manage the consistency of replication logs. So where did the RAFT algorithm come from? Before the emergence of the RAFT algorithm, the Paxos algorithm dominated the neighborhood of the consistent algorithm, but the obvious disadvantage of Paxos is that it is "very difficult to understand", of course, it is difficult to implement. In 2013, Stanford's Diego Ongaro and John Ousterhout designed the RAFT consistency algorithm to be easy to understand, and published the paper "In Search of an Understandable Consensus Algorithm". The RAFT algorithm is easy to understand and easy to implement, which makes it widely used in the industry, and the correctness of the algorithm is proved in these practices. Give likes to the bosses, and their idea is, "the world can't give me what I want, so I'll change the world."

Here we will briefly introduce the principle of RAFT algorithm to help you understand the high availability of Giant Sequoia distributed database. If you want to learn more about RAFT algorithm, you can refer to the article on github:

Https://github.com/maemual/raft-zh_cn/blob/master/raft-zh_cn.md

The design idea of RAFT algorithm is to simplify complex problems and decompose the data consistency problem of a group of servers into three small problems. These three minor problems include:

Leadership election: when an existing leader goes down, a new leader needs to be elected

Log replication: leaders must receive logs from the client and copy them to other nodes in the cluster, and force other nodes to keep their logs the same as they do.

Security: the key to security in Raft is state machine security: if any server node has applied a certain log entry to its state machine, other server nodes cannot apply a different instruction at the same log index location.

By solving the above three minor problems, Raft solves the big problem of consistency.

The Raft algorithm also simplifies the number of replication state machines, and each server must be in three states: leader (Leader), candidate (Candidate), and follower (Follower). The transition of the state is shown in the figure:

Followers only respond to requests from other servers. If followers do not receive any news, it will become a candidate and start an election. The candidate who receives most of the server votes will become the new leader. Leaders maintain their leadership status until they go down.

3.1.1 leadership election

The RAFT algorithm recommends that a set of servers include at least five servers, so that the cluster can still provide services even if two servers are down. The state of the server can only be one of three, and there can only be one leader in a cluster. Cluster servers know each other's status and trigger the election through heartbeat information. When the server program starts, they are all followers. A server node continues to maintain a follower state as long as it receives a valid RPCs from the leader or candidate. Leaders periodically send heartbeats to all followers (that is, additional log entries RPCs that do not contain log entries) to maintain their authority. If a follower does not receive any message for a period of time, that is, the election timed out, he will assume that there are no available leaders in the system and initiate an election to elect a new leader.

To start an election process, followers must first increase their current term number and transition to candidate status. He then sends a RPCs requesting a vote to other server nodes in the cluster in parallel to vote for himself. The candidate will remain in his current state until one of three things happens: (a) he wins the election himself, (b) other servers become leaders, and (c) no one wins after a period of time. A candidate must get votes from more than half of the servers in the cluster in order to become a leader.

3.1.2 Log replication

Once a leader is elected, he begins to provide services to the client. Each request from the client contains an instruction executed by the replicated state machine. The leader appends this directive to the log as a new log entry, and then initiates an additional entry RPCs to other servers in parallel, asking them to copy the log entry. When the log entry is securely copied (described below), the leader applies the log entry to its state machine and returns the result of the execution to the client. If the follower crashes or runs slowly, or the network loses packets, the leader will repeatedly try to attach the log entry RPCs (although it has replied to the client) until all the followers finally store all the log entries.

It is safe for the leader to decide when to apply log entries to the state machine; such log entries are called committed. The Raft algorithm ensures that all committed log entries are persistent and eventually executed by all available state machines. When the leader copies the created log entry to most servers, the log entry is submitted. At the same time, all previous log entries in the leader's log are submitted, including those created by other leaders. The leader tracks the index of the largest log entry that will be committed, and the index value will be included in all future additional log RPCs (including heartbeat packets) so that other servers can finally know where the leader will submit. Once the follower knows that a log entry has been submitted, he will also apply the log entry to the local state machine (in the order of the logs).

The log replication mechanism shows consistency: Raft can accept, copy and apply new log entries as long as most of the machines are working; in general, new log entries can be replicated to most machines in the cluster in a single RPC; and a single slow follower does not affect overall performance.

3.1.3 Security

However, the mechanisms described so far do not fully guarantee that every state will execute the same instructions in the same order. For example, a follower may go into an unavailable state and the leader has submitted several log entries, and then the follower may be elected leader and overwrite those log entries; therefore, different state machines may execute different sequences of instructions.

This section refines the Raft algorithm by adding some restrictions in the leadership election. This restriction ensures that any leader has all submitted log entries for a given term number. By increasing the restrictions on this election, we are also clearer about the rules at the time of submission. Finally, we will show a brief proof of the leader's complete characteristics and show how the leader's integrity feature leads the replication state machine to behave correctly.

Election security

Raft uses a simpler method to ensure that all log entries submitted in previous term numbers will appear in the new leader at the time of the election, without having to send these log entries to the leader. This means that log entries are sent one-way, only from leaders to followers, and leaders never overwrite entries that already exist in their local logs.

Raft uses voting to prevent a candidate from winning the election unless the candidate contains all submitted log entries. In order to win the election, the candidate must contact most of the nodes in the cluster, which means that every submitted log entry must exist on at least one of these server nodes. If the candidate's log is at least as new as most server nodes (this new definition is discussed below), then he must have all the submitted log entries. The request for vote RPC implements the restriction that the RPC contains the candidate's log information, and then the voter will reject those logs that do not have their own new voting requests.

Raft defines whose log is newer by comparing the index value of the last log entry and the term number of the two logs. If the term number of the last entry in the two logs is different, the log with a larger term number is newer. If the last entry of the two logs has the same term number, the longer one is newer.

Submit log entries from previous terms of office

Leaders in RAFT know that a log record during the current term can be submitted as long as it is stored on most servers. If a leader crashes before submitting a log entry, future leaders will continue to try to copy the log record. RAFT's leader election mechanism ensures that the log of the new leader must contain the latest term number and the latest log entries that have been copied to most servers in the previous term, so that the new leader can copy the log entries from the previous term to other servers through the log matching feature to achieve indirect submission.

3.2 High availability implementation of Giant Sequoia Database

Because the distributed database must have the ACID (atomicity, consistency, isolation, persistence) and distributed transaction capabilities of the database, and also need to provide high-performance parallel computing capabilities, the scenario is much more complex than that mentioned in the RAFT algorithm.

For example, the RAFT algorithm ensures the consistency of the data of each node by ensuring the order of logs, but in the process of parallel execution of transactions in the database, multiple log records may be generated by each transaction, and the log records of multiple transactions are likely to be generated cross rather than sequentially. In order to maintain the atomicity of transactions in a distributed environment, all log records generated by a transaction are required to be traceable. These logs must exist on each node so that atomicity can be maintained on each node when the transaction is committed or rolled back. During the execution of a transaction, it is possible to update a large amount of data (batch transactions). In a distributed database, it is impossible to wait until the execution of the master node is completed and then replicate to the slave node. the execution performance of this method is too poor to meet business requirements. The practice of the giant sequoia database is that each write operation in the transaction will produce a log record, and the master node will copy these log records to the slave node to redo, and will not wait for the transaction commit to start the replication redo, and then when the transaction commits, implement the two-phase commit algorithm within the replication group to ensure the atomicity and consistency of the transaction in the replication group.

In addition, the successful submission of the command in the RAFT algorithm is based on the fact that most of the servers in the cluster have a log record of the command. RAFT algorithm can guarantee that the cluster is ultimately consistent, but it does not guarantee that the cluster is consistent at a certain time. This mechanism will have problems in the implementation of cluster disaster recovery. For example, in a cluster of five servers, three servers are in center An and two are in center B. if the leader is in center A, then three servers in An are on the same network. usually, the log replication time is shorter than the two servers in remote center B. in other words, there is no guarantee that the servers in center B must hold the latest log records. if there is a disaster in center A, it cannot be recovered. Then the data of the server in Center B is incomplete. This situation is unacceptable for some businesses (bank accounts). The practice of the giant sequoia database is to set the number of replicas parameter to force how many nodes must hold the latest log records in order to be considered successful.

Therefore, the Giant Sequoia distributed database optimizes the RAFT algorithm in the implementation of node election, while completely develops its own algorithm in data synchronization / log replication. In the architecture of distributed database of giant sequoia, the master node corresponds to the leader of the RAFT algorithm, the slave node corresponds to the follower, and the candidate is the same; the replication log of the giant sequoia database corresponds to the log records of the RAFT algorithm; the replication group of the giant sequoia database corresponds to a group of servers in the RAFT algorithm. The Giant Sequoia database supports multiple replication groups in a cluster, and each replication group can be seen as a server for a set of RAFT algorithms.

The node types of giant sequoia distributed database are divided into coordination node, cataloging node and data node. Among them, the task of multiple coordination nodes is to receive commands and distribute commands to cataloging nodes and data nodes, which do not save data and have no relationship with each other, so there is no need to consider usability. The cataloging node is actually a special kind of data node, and its high availability is the same as that of the data node. The orchestration node can be thought of as the client of the data node replication group.

The Giant Sequoia distributed database also realizes the big problem of cluster data consistency by solving three small problems.

3.2.1 Node election

The replication group of the giant sequoia database can contain 1 to 7 data nodes, but the number of nodes in the replication group must be greater than or equal to 3 if you want to provide highly available feature replication groups. Each replication group will have only one master node at the same time, and the others are slave nodes. If you are in the master stage, each slave node can be a candidate node. The selection master of the giant sequoia database also uses the way of voting in the same group of nodes, and the node that gets more than half of the votes becomes the master node.

In order to ensure the success of the selection and the selected master node contains the latest database logs, the giant sequoia database has made some optimizations based on the RAFT algorithm, especially in the qualification selection of candidate nodes.

In order to become a candidate node and initiate an election request, a slave node must have the following conditions: it is not the master node, the number of remaining nodes that can communicate with its heartbeat must be more than half, and its own LSN (log sequence number) is newer than the LSN of other nodes. The replication group cannot automatically initiate an election request when the master node exists (the master node can be switched manually), and the slave node can initiate the election master request only if the master node is not available. If the cluster is split, for example, five nodes are split into 2-node clusters and 3-node clusters that are not connected to each other due to network reasons, if the current master node is in a 3-node cluster, then the selection of master will not occur because the number of nodes in this cluster is more than half. If the current master node is in a 2-node cluster, then the master node will automatically be reduced to a slave node, and the 3-node cluster will initiate a master request. Reselect a primary node.

When all nodes in the replication group are normal, each node shares status information by sharing heartbeat information sharing-beat. Shared heartbeat information includes: heartbeat ID, self-start LSN, self-termination LSN, timestamp, data group version number, current role and synchronization status. As shown in the figure:

Each node maintains a status-sharing table table to record the status of the node. Sharing-beat sends every 2 seconds to collect response information. If the reply message is not received twice in a row, the node is considered to be down. The ReplReader (copy listener thread) in the node process is responsible for receiving and sending node status information.

Before initiating the selection of a master from a slave node, the LSN of this node and the LSN of other slave nodes in the shared heartbeat information are checked. If your LSN is greater than or equal to the LSN of other slave nodes, you can initiate the selection master request, otherwise it will not be initiated.

In the selection master, if multiple candidate slave nodes have the same LSN and all initiate requests, the weight configuration parameters (weight) of each slave node will be compared. Each node of the replication group can be configured with different weights separately. From 0 to 100, the higher the number, the higher the weight. The slave node with the same LSN, the master with high weight is selected successfully.

If the LSN of multiple slave nodes is the same and the weight configuration is the same, then the node numbers of these slave nodes will be compared. When creating a node, the node number of each node is different, and the larger node number will select the master successfully.

It can be seen from the above that the main selection process of Giant Sequoia distributed database is optimized, and the judgment of weight and node number is added on the basis of RAFT algorithm, so that the failure of multiple nodes with the same number of main votes in RAFT algorithm will not occur.

Here is a simple example to describe the selection process of the giant sequoia database.

As shown in the figure, a 3-node replication group includes: master node A, slave node B, slave node C. Under normal circumstances, three nodes share status information through a heartbeat. If the master node A fails for some reason (server power down / network failure / disk failure, etc.) will automatically be reduced to a slave node, the cluster begins to re-select the master.

If the two slave nodes find that the heartbeat information of the master node is not received in the second round, the master node is considered to be down.

Slave node B: "Hey hey, the master node is dead, I have a chance to be the boss. Let me see, I am the slave node, LSN is the latest. Hi, brother C, my LSN is XXX, I want to be the boss, do you agree?"

From Node C: "Gaga, the boss is dead! not being the eldest brother for many years, I miss a little. Let me see, I am from the node, LSN is the latest, there is a chance. Hi, Brother B, my LSN is XXX, I want to be the boss, do you agree?"

From Node B: "Brother C also wants to be the boss, LSN is the same as me. Hi Brother C, my LSN is also XXX, my weight is 80, or should I be?"

From node C: "Brother B, I'm sorry, my weight is also 80, I also want to be." my node number is 1008. "

From node B: "Brother C's node number is 1008, mine is 1006; lose, brother C, you are the boss."

From Node C: "I voted for myself, and Brother B voted for me. Now it's 2 votes, which is greater than 3. That's enough. Hey, Brother B, I'm the boss."

From node B: "copy that, boss."

After the reselection is successful, the new master node will notify the cataloging node to change the master node information; the coordination node will synchronize the node status information from the catalog node and connect to the new master node; the coordination node can also obtain the new master node information by traversing the node. In this way, high availability is achieved after the primary node goes down. After the original master node An is restored, it automatically becomes a slave node.

3.2.2 Log replication

The log replication mechanism of giant sequoia database is different from that of RAFT. First of all, the log of the RAFT algorithm is always sent from the leader (Leader) to the follower (Follower), and the log data is not exchanged between the followers. The log replication of the giant sequoia database is carried out between the source node (the node that sends the log) and the target node (the node that requests the log), and the target node actively requests log data from the source node, and the source node is mostly the master node. but other slave nodes can also be source nodes, and the target node can only be slave nodes. Secondly, in the RAFT algorithm, if the leader does not have the log replication return information of the follower, the leader will continue to send logs to the follower until the return message is received, which is incremental replication; while the data synchronization replication of Giant Sequoia database can be log incremental replication or trigger full data synchronization.

The log replication of giant sequoia database does not use the RAFT algorithm because it is not suitable in the scenario of distributed database. For example, if a slave node is down, and the master node continuously sends log information to the slave node according to the RAFT algorithm, and the log data is very large, it will occupy a lot of CPU/ memory / network resources of the master node, seriously affect the performance of the master node, and may cause blocking. The giant sequoia database uses the method of request from the slave node, which requests new log data only when the slave node is in a normal state, and gives clearly what log data is needed, which will not cause repeated transmission problems. and the slave node can be used as the source node to greatly reduce the workload of the master node. In addition, in the actual environment, the storage space is limited, and the log space that the database can configure is also limited, so the data synchronization mode of giant sequoia database is divided into two ways: log incremental synchronization and full data synchronization. If the data that the target node needs to synchronize is contained in the replication log space of the source node, the log incremental synchronization is carried out; if the data is no longer in the replication log space, then full synchronization is required. Most importantly, the distributed database requires the ability to support ACID and transactions, and its scenario is more complex than the log replication scenario described by the RAFT algorithm. The giant sequoia database uses a two-phase commit algorithm to support the transaction capabilities within the replication group while taking into account both performance and transaction consistency.

Log incremental synchronization mode

In the data node and catalog node, any data addition, deletion or modification operation is written to the log. SequoiaDB first writes the log to the log buffer and then asynchronously writes it to the local disk.

Each data replication occurs between two nodes:

Source node: a node that contains new data. The primary node is not always the source node of the replication.

Target node: the node that replicates data for the request.

During replication, the target node selects the closest node (in the shared node status table, it includes which of the start LSN and end LSN of each node is available to calculate which is closest to it), and then sends it a replication request. After receiving the replication request, the source node will package and send the log records after the synchronization point requested by the target node to the target node, and the target node will reprocess all operations in the log after receiving the packet.

Replication between nodes has two states:

Peer status (PEER): when the log requested by the target node still exists in the log buffer of the source node, the peer state is between the two nodes

Remote catch-up status (RemoteCatchup): when the log requested by the target node does not exist in the log buffer of the source node, but still exists in the log file of the source node, there is a remote catch-up state between the two nodes

If the log requested by the target node no longer exists in the log file of the source node, the target node enters the state of full synchronization.

When the two nodes are in a peer-to-peer state, the synchronization request can obtain data directly from memory at the source node, so when the target node chooses to replicate the source node, it will always try to select the data node closest to its current log point. keep the logs in memory as far as possible.

Full data synchronization

In the partition group, when a new node joins the partition group, or the failed node rejoins the partition group (the log version of the failed node is too different from that of other nodes, exceeding the size of the node transaction log space; or restart to find that the data of the failed node is inconsistent), full data synchronization is needed to ensure the data consistency between the new node and the existing node.

Two nodes are involved in the full data synchronization:

Source node: a node that contains valid data. The primary node is not always the source node of synchronization. Any slave node that is synchronized with the master node can be used as the source node for data synchronization.

Target node: a failed node that is newly joined or rejoined the group. During synchronization, the original data under this node will be discarded.

When full synchronization occurs, the target node periodically requests data from the source node. The source node packages the data and sends it to the target node as a block of big data. When the target node redoes all the data in the data block, it requests a new data block from the source node.

In order to ensure that the source node can write during synchronization, if all data pages that have been sent to the target node are changed, their updates will be synchronized to the target node to ensure that the updated data will not be lost in the process of full synchronization.

Number of synchronous copies

When realizing data synchronization in Giant Sequoia database, in order to adapt to different business scenarios, you can set the ReplSize (number of synchronous copies of write operations) parameter for each collection separately.

The default ReplSize value is 1. The optional values are as follows:

-1: indicates that the write request needs to be synchronized to several active nodes of the replication group before the database write operation returns to the client.

0: indicates that the write request needs to be synchronized to all nodes of the replication group before the database write operation returns to the client.

1-7: indicates that the write request needs to be synchronized to the specified number of nodes in the replication group before the database write operation returns to the client.

In a real project, to ensure that the data is not lost, you need to set the ReplSize to greater than 1, or-1. For example, if the ReplSize of a collection is set to 2, the write operation of the master node will not return success to the coordinator node until a slave node returns successfully. This ensures that if the master node goes down, at least one slave node already holds successful log data. When reselecting the master, the slave node must become a candidate according to the latest LSN rules, thus ensuring that the data is not lost.

3.3 Security

The cluster consistency algorithm of giant sequoia database makes some restrictions on election rules and replication rules in order to ensure that the data is not lost and the security of data.

Election restriction

As mentioned earlier, in the giant sequoia database, if a node wants to become a candidate node and initiate a master request, it must have the following conditions: it is not the master node, and the number of remaining nodes that can communicate with its heartbeat must be more than half, and its own LSN (log sequence number) is newer than the LSN of other nodes. These rules ensure that the data of the new master node is up-to-date and can communicate with more than half of the other nodes.

Log replication limit

Within the replication group, the giant sequoia database uses a two-phase commit algorithm to support transaction capabilities, and uses the table configuration parameter ReplSize (number of write copies) to force the master node to synchronize how many slave nodes successfully. Through these means, the giant sequoia database can ensure the consistency / integrity of the data and transactions of the nodes within the replication group.

In terms of log replication, the master node of the replication group will ensure that each log (including a unique and incremental LSN) can be replicated to the slave node in accordance with the rules of log replication. For example, if the ReplSize parameter of a table is set to 2, the master node ensures that it will not return success to the coordinator node until it receives confirmation of the completion of log replication of at least one slave node, otherwise it will return failure; if ReplSize is set to-1, the master node will not return success to the coordinator node until it receives confirmation of successful log replication from all active slave nodes, otherwise it will return failure.

In terms of transaction integrity, the implementation process of the two-phase commit transaction of the giant sequoia database in the replication group is as follows:

The client starts the transaction using transBegin () and sends it to the coordinator node, which then sends it to the master node for execution. The master node generates a unique transaction ID.

After the primary node receives the first write operation (such as insert,update,delete), a log record is generated (the log record contains the transaction ID). This log record will be sent to other slave nodes according to the rules (the number of copy copies, even if ReplSize is set to 1, the giant sequoia database must ensure that a slave node holds a log record of the transaction ID when executing the transaction) and confirm the success.

The master node will continue to complete other operations, and any write will generate a new log record, which will be sent to other slave nodes according to the replication rules and confirmed to be successful.

After the master node receives the transCommit () instruction, it will start the two-phase commit.

The master node first executes the first phase, pre-submission (pre-commit). At this stage, a pre-committed log record is generated and sent to the slave node and ensures that the receipt is acknowledged from the slave node.

The master node then performs the second phase, the actual commit (Snd-Commit). At this stage, a submission log record is generated and sent to the slave node for execution and ensures that it is successfully received from the node. When this phase is successful, the entire transaction is confirmed to be successful and returned to the orchestrating node successfully.

During the entire transaction, if any violation of replication synchronization rules occurs and the transaction cannot proceed (for example, when inserting data, the ReplSize of the table in which the data resides is 3, but for some reason the replication group, including the primary node, only gives 2 to the active node, then the insert operation will fail and the derivative transaction cannot continue), then the transaction will be rolled back. Restore the operation that has been performed before.

If the master node downtime occurs during the transaction, the Giant Sequoia database will automatically determine whether to continue to commit the transaction or roll back the transaction after the successful election of the new master node according to the execution of the transaction to solve the suspicious transaction problem.

The answer to the question about the principle of high availability of SequoiaDB is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.