Solving the High availability problem of distributed Database: the implementation of TDSQL High availability solution 04/17 Update SLTechnology News&Howtos

Solving the High availability problem of distributed Database: the implementation of TDSQL High availability solution

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Tencent Cloud Database domestic database online technology salon is in full swing. The sharing of Zhang Wen on March 12 has ended. For those who have not had time to participate, don't worry. Here is the live video and text review.

Follow the official account of "Tencent Cloud Database" and reply to "0312 articles" to download LVB and share PPT.

Solving the High availability problem of distributed Database: the implementation of TDSQL High availability solution _ Tencent Video

Hello everyone, the theme I share today is TDSQL multi-location, multi-center and high availability solution. TDSQL is a financial-level distributed database launched by Tencent. In terms of availability and data consistency, based on the self-developed strong synchronous replication protocol, it not only ensures the double copies of data across IDC, but also has high performance. On the basis of strong synchronous replication, TDSQL implements a set of automatic disaster recovery switching scheme, which ensures zero data loss before and after handover, and provides 7 × 24 hours of continuous high availability service for the service.

In fact, not only databases, but any system that requires high availability requires a highly available deployment architecture. Some technical terms will be involved, such as remote location, multi-living, double living, election, etc., which will be mentioned in today's sharing, in addition to the familiar concepts of two places and three centers, two places and four centers, double living in the same city, and so on. It is worth mentioning that today's sharing is not only for the database, but also for the deployment architecture of any high-availability system.

In this sharing, we will introduce several typical deployment architectures of TDSQL, as well as their advantages and disadvantages. Because in the actual production practice, there will be a variety of resource constraints, such as the disaster recovery effect of one computer room and two computer rooms in the same city is completely different. For example, although there are two computer rooms, but the specifications of the two computer rooms are quite different, some computer rooms may have better network links, while some computer rooms may have poor network links, and so on. Therefore, how to build a cost-effective or cost-effective TDSQL under the condition of limited resources is the main content of this sharing.

Before we get to the point, let's review the last time we shared the core features and overall architecture of TDSQL.

OK, let's take a look at the core features of TDSQL first.

Our focus today is on "financial high availability". How does TDSQL achieve more than 99.999% availability? The so-called high availability of five nines means that the unavailable time for the whole year should not exceed 5 minutes. We know that failure is an unavoidable phenomenon, and at the same time, faults are graded, from software failure, operating system failure to machine restart and rack power outage, which is a process from low to high disaster level. for financial-level databases, we need to consider and deal with higher-level failure scenarios, such as the power loss of the entire computer room and even natural disasters such as earthquakes and explosions in the city where the computer room is located. If this kind of failure occurs, whether our system can first ensure that the data will not be lost, and then how long it will take to restore the service on the premise of ensuring that the data will not be lost, these are all issues that need to be considered in the financial-level high-availability database.

1. TDSQL database consistency: strong synchronization mechanism is the core guarantee.

First of all, let's review the key feature of TDSQL-strong synchronization mechanism, which is the key to ensure that data will not be lost and error-free in TDSQL, and compared with semi-synchronous replication in MySQL, the performance of TDSQL strong synchronous replication is closer to asynchronous.

So how does this high-performance strong synchronization work? Strong synchronous replication requires any successful request to answer the business, not only on the host, but also on at least one standby. So we see that after a request is sent to the host, it is sent to the standby immediately, and the host can only reply to the service successfully when one of the two slave computers is successful. In other words, any request successfully answered to the front-end service must have two copies, one on the primary node and the other on the standby node. Therefore, strong synchronization is the key guarantee for multiple copies of data. TDSQL asynchronizes the worker thread by introducing the thread pool model and flexible scheduling mechanism, which greatly improves the performance of strong synchronization, and its throughput is close to asynchronous after the transformation.

(TDSQL Core Architecture)

After watching strong synchronization, let's review the core architecture of TDSQL. Load balancer is the entrance to the business. The business request reaches the SQL engine module through the load balancer, and then the SQL engine forwards the SQL to the back-end data node. The upper part of the figure is the management and scheduling module of the cluster, which is responsible for resource management, fault scheduling and other work as the general controller of the cluster. Therefore, TDSQL as a whole is divided into computing layer, data storage layer, and cluster management node. Cluster management nodes we emphasized before, a cluster deployment of one set, usually 3, 5, 7 such an odd number of deployments. Why is it odd? Because when there is a disaster, an odd number can form an election, for example, if three nodes are deployed in three computer rooms, and one of the computer rooms fails, while the other two computer rooms can communicate with each other and neither of them is connected to the third computer room, we can reach a consensus on the failure of the third computer room and kick it out. We can think of the management module of the three computer rooms as the three brains of the cluster. after one brain is broken, if the remaining brain can reach half of the initial number, it can continue to provide services, and vice versa.

II. Deployment practice of highly available clusters

The above is a review of some of the core features of TDSQL. Next, let's take a look at the model selection of each module. For the distributed architecture database with the separation of computing and storage, how should we choose the machine? We know that in order to make an IT system play the greatest value, it is necessary to drain the machine's resources as much as possible, so that these resources are more cost-effective. If the machine CPU is running full, but the IO has no load, or 128GB of memory but actually only 2G is used. This is a very low performance ratio deployment, unable to play the overall performance of the machine and the system.

The first is the LVS module. First of all, as the access layer, it is not an internal component of TDSQL. TDSQL's SQL engine is compatible with different load balancers, such as software load LVS, hardware load L5 and so on. As an access layer, load balancer is generally a CPU centralized service, because it needs to maintain and manage a large number of link requests, which relatively consumes CPU and memory. Therefore, in the recommended configuration here, the CPU is relatively high, with 16vCPU and 32G memory, which must be a 10 Gigabit network port. Today, the cost of network card equipment has been very low, so it is generally necessary to assemble 10 gigabit network cards. What vCPU emphasizes here is that the logical core is 16 cores (maybe there are only 8 physical cores), because most of our programs are multithreaded.

Let's look at the computing nodes again. If the cluster is small and the resources are relatively tight, the computing node can reuse machines with the storage node, because the model of the storage node can basically meet the needs of the computing node, with a 16v CPU 32G + memory and 10 Gigabit network port.

After talking about the access layer and computing layer, let's take a look at the data storage layer. The storage node is responsible for data access, which is an IO-intensive service. It is recommended to use PCI-E 's SSD, and a stand-alone physical machine is required. For databases, we recommend that they be deployed on real physical machines, which are more stable than virtual machines. In addition, conditionally, it is suggested to do another layer of Raid0 to make the read and write ability of data nodes more powerful. Some students will certainly ask why the data node does not do a Raid5, Raid10 but directly do Raid0. Because TDSQL itself is an one-master and multi-slave architecture, and even more slaves can be added, there is no need for us to continue to do redundancy in the disk array, unlimited redundancy will only reduce the performance ratio. As a data node, the recommended configuration here is 32vCPUnot 64GB of memory. The Innodb engine used by the data node is an engine that gives priority to caching, which means that large memory plays a significant role in improving performance. Therefore, SSD models with large memory, 10 Gigabit and PCI-E are recommended here.

Next is the management node: the recommended configuration is 8-core CPU, 16 gigabytes of memory, and 10 Gigabit network ports. The task of managing nodes is relatively light, and a cluster only needs a small number of administrative nodes. If there is no physical virtual machine, the configuration can be significantly lower than the previous computing and storage nodes.

Backup node, the bigger the better, it is mainly responsible for storing cold data, using an ordinary SATA disk.

The models of the above machines do not need to pay too much attention to, it is an internal number of Tencent, there is no practical significance. To make a simple summary here, the compute node relies on CPU and memory and does not require too much disk. Although the storage node also requires higher CPU memory, it places more emphasis on the strength of IO (requires PCI-E 's SSD).

Through the introduction just now, I hope it can help you to further deepen your understanding of TDSQL, especially the database with separate computing and storage. From the interpretation of the model configuration, we can clearly understand what kind of model configuration can make the best performance of the system.

To sum up, if the models are matched correctly and the business is used in accordance with the specifications, then you can easily give play to the strong performance of the database, that is, to obtain higher business support capacity with lower operating costs. A large amount of business is accompanied by cash flow, such as advertising, games, e-commerce. If a low-cost system handles such business requests easily and efficiently at full load, the economic effect is considerable.

Third, the deployment plan of disaster recovery across cities and computer rooms.

The third part starts to get to our point. In this chapter, we will introduce several typical deployment plans, those familiar terms: three centers in the same city, two places and three centers, living in different places, what is the meaning of living in different places, and what kind of effect it can bring. Next, I will uncover the answers to these questions for you one by one.

1. The structure of "three centers in the same city"

The first part is the structure of the three centers in the same city. As the name implies: there are three computer rooms A, B and C in a city, and TDSQL still adopts the structure of "one master and two backups". Obviously, we need to deploy the three data nodes in three data centers respectively, with the master node in one data center and the two standby nodes in the other two data centers.

Each IDC provides two highly available LV5 for load balancing system. Why does every IDC have to put a LV5? Because each IDC has its own business, it needs to have an independent load access. From the access layer, the three computer rooms are a relatively parallel peer-to-peer structure, and all three computer rooms have their own business. Maybe the first computer room supports the business of a national region, and the second data center is another region. That's what peer-to-peer means. This kind of structure is relatively simple and the whole is relatively clear.

When we look at the architecture diagram, we just said that it is a symmetrical structure. Look at the business first from top to bottom. Each data center may be equipped with a business system, and the business of each data center accesses TDSQL's SQL engine through LVS load access. Since multiple SQL engines need to be deployed in the same data center to provide highly available services, and businesses prefer to block the access of multiple SQL engines at the back end, a LVS access layer is introduced here, and the business only needs to access the VIP of the load balancer. When the request arrives at the SQL engine, the SQL is sent to the master or slave node according to the routing information, and finally the business data is returned. Let's take a look at the data node. One master and two slaves are deployed in three computer rooms respectively. If any one of the data rooms fails, the master node can be switched to one of the other two data centers. Under the architecture of three centers in the same city, there is no single point from the computing layer to the storage layer, thus achieving high availability and disaster recovery. The failure of any computer room will not cause data loss, and under the guarantee of TDSQL consistent switching mechanism, the failure node can be switched within 30s.

The management node is not shown in this diagram, and we have just said that the management node can be seen as the brain of the entire cluster, responsible for judging the current overall situation. The three computer rooms obviously need to deploy three brains. As the "three" has just mentioned, when one of them has a brain problem, the other two can form a majority and vote each other to get rid of the malfunctioning brain.

two。 Structure of "single center in the same city"

There are several scenarios for the "single center in the same city" architecture:

The first scenario is that IDC resources are tight and there is only one data center. In this scenario, cross-server deployment is not possible, but can only be deployed in a cross-rack manner. When the master node fails or the rack where the master node is located, it can be switched automatically.

The second scenario is that the business pursues extreme performance and cannot even tolerate latency across IDC networks. Although the current computer rooms are all optical fiber networks, the network latency between the two computer rooms separated by 50km is less than 1ms. But some special businesses can't even stand a millisecond delay. In this case, we can only deploy the master and standby in the same computer room.

The third category is used as a remote disaster preparedness computer room. As disaster preparedness storage, there is generally no actual business access, more likely to do backup and archiving, so the investment in its resources is relatively limited.

The fourth is as a test environment, so I won't say much about this.

3. The structure of "two places and three centers"

Next, let's talk about the protagonist of this sharing-the "two places and three centers" architecture, which is not only a common deployment method for banks, but also the basic deployment architecture for regulatory requirements. Through two data centers in the same city plus one data center in different places, this architecture provides better availability and data consistency at a lower cost. It can be automatically switched between node exceptions and IDC exceptions, which is very suitable for financial scenarios and is the key deployment method recommended by TDSQL.

(deployment architecture of "two places and three centers" database instance)

In terms of deployment, we look at it from top to bottom, which are two computer rooms in the same city and one disaster preparedness computer room in different places. The top layer is the brain management module of the cluster, which is deployed across three computer rooms.

The management module can be deployed in the form of "2-2-1" or "1-2-1-1". We know that if we follow the deployment method of "2-2-2-1", when the first computer room fails, there will be "2-2-1" brain left, "2-2-1" is more than half of 5, and the remaining "2-2-1" will form a majority to kick out the failure node while continuing to provide services.

Looking further down, we can see that the data node adopts the mode of "one master, three backups", which is strongly synchronous across the computer room and asynchronous with the same data room. Why the same computer room here is asynchronous, can not do strong synchronization? If the server room is strongly synchronized, since the distance between the master node and the master node is closer than that of the other two slave nodes across the data center (the average distance between IDC1 and IDC2 is at least 50 km), every request sent by the business to the master node is answered first by the strong synchronization node in the same data center, and the latest data will always fall on the slave node in the same data room. We hope that the two copies of the data should be located in different data centers separated by 50km, so as to ensure that the data can be consistent when switching between master and slave across data centers.

One may ask that there is no difference between the asynchronous node of this IDC1 configuration and not letting go. Here's why it's better to have this asynchronous node. Let's consider a situation where the master node in IDC 1 becomes a single point when the IDC2 in the standby room fails and all the two nodes in the slave room are down. At this point, if strong synchronization is enabled, the master node is still unable to provide services because there is no standby response. However, if strong synchronization is turned off and continues to provide services, there is a single point of risk for the data. If a software and hardware failure occurs on the master node, the data will never be recovered. A better solution is to add an asynchronous node across racks to the IDC1, which will be promoted to strong synchronization when the IDC2 is dead. In this way, under the premise that there is only one computer room left, we can still guarantee a copy across the rack and reduce the single point risk of the mainframe.

After seeing the two computer rooms in the main city, let's take a look at the disaster preparedness computer rooms in other places. As a remote disaster preparedness computer room, it is usually more than 500km away from the main node and delayed above the 10ms. Under such network conditions, the data can only be synchronized by asynchronous replication between the disaster recovery node and the main city, so the remote disaster recovery node bears more responsibility for backup, and there will not be too many formal business visits every day. Although there is a vase on the surface, it is impossible to do without it. If a city-level failure occurs one day, disaster recovery instances can still recover more than 99% of the data for us. It is precisely because of this asynchronous weak relationship between the disaster recovery node and the master node that our disaster recovery instance is allowed to be an independently deployed unit in the disaster recovery city.

In addition to serving as an asynchronous data backup, another important responsibility of the remote disaster recovery computer room is: when a computer room in the main city fails, by forming a majority with another normal computer room in the main city, kick out the failed computer room and complete the master / standby switch. The brain, which is deployed in a different place, is not involved in the main city most of the time, only when a computer room in the main city breaks down. Under normal circumstances, the module of the main city accesses the brain of the main city, and the module of the backup city accesses the brain of the backup city, which will not lead to the problem of excessive delay.

4. "two centers" structure

After talking about the "three centers", let's talk about the "two centers" structure. Specifically, there are only two data centers in the same city. According to the experience of our last PPT, deploying TDSQL in the two data centers needs to be deployed asynchronously in the same data center and strongly synchronized across data centers. Therefore, a four-node model is adopted and distributed in 2 IDC.

However, a tradeoff of the "two centers" architecture is that only if it is deployed in the standby room and the failure is not the standby center, can automatic cross-IDC disaster recovery be realized. However, if it is a standby center failure, in fact, in the way of asynchronism in the same server room and strong synchronization across data rooms, whether deployed in the main server room or the standby server room, if there is a failure, the majority election and automatic failover cannot be successfully completed. Either the strong synchronization node cannot be promoted to form a majority, or the majority random room failure requires manual intervention. Therefore, in scenarios with high availability requirements, 724-hour high-availability deployment architectures such as "two places and three centers" are generally recommended.

IV. Summary

Finally, let's sum up today's sharing:

1. First of all, for cross-city disaster recovery, it is generally recommended to build an independent cluster mode in different places and achieve synchronization through asynchronous replication. The main city and the standby city can be deployed in different ways, such as one main city, three standby units, and one main and one standby city.

two。 The best solution for the current network operation is three centers in the same city plus a remote disaster preparedness center, followed by the financial industry standard structure of two places and three centers. Both architectures can easily achieve abnormal automatic switching in the data center.

3. If there are only two data centers, any data center exception can be automatically switched, which requires some tradeoffs.

Not only for database systems, any highly available system needs to be based on deployment architecture considerations, this is the whole point of this sharing, thank you.

5. Qyoga

Q: how long does it take to switch between master and slave in the same city?

ARV within 30 seconds.

Q: is the main city of the two places and three centers set up as a cascade?

A: this is a very good question. From the perspective of the main city, it is obviously a cascading relationship. The data is first synchronized from the master of the main city to the slave of the main city, and then to the master of the backup city through the slave of the main city, and the data is transferred down layer by layer.

Q: will strong synchronization wait for SQL playback?

A: it won't wait, as long as the IO thread pulls the data. Because binlog based on line format is idempotent, we have proved it to be reliable through a large number of cases. In addition, the increase of apply will increase the average time consumption and decrease the throughput. Finally, if there is a problem with apply, TDSQL's monitoring platform will immediately identify and alert DBA to confirm processing.

Q: the slave only stores binlog and does not play back. Can it keep up with the master in performance?

A: slave pull binlog and playback binlog are two different groups of threads, called IO threads and SQL threads, and the two groups of threads do not interfere with each other. The IO thread is only responsible for downloading the binlog,SQL on the master and only playing back and pulling the local binlog. The previous problem said that the strong synchronization mechanism does not wait for playback, not that the binlog of the standby will not be played back.

Q: the write and storage nodes of the three centers in the same city are all in IDC1, so is the business delay in IDC2 very large?

A: all the computer rooms in the same city are now transmitted by optical fiber, and the time consumption is basically less than 1 millisecond. There is no need to worry about this kind of access time. Of course, if the computer room facilities are old, or the network links between distances are extremely unstable, some disaster recovery capabilities may need to be sacrificed in order to pursue excellent performance.

Q: one master and two slaves. Is there a VIP way for SQL engine to fail over?

A: of course. Multiple SQL engines are bound to load balancer devices. Businesses access the TDSQL through VIP. When the SQL engine fails, the load balancer will automatically kick it out.

Q: doesn't it mean that each of the three businesses writes a library?

A: no, all three businesses are written to the main library. All SQL engines are routed to the main library, with one master and two backups. TDSQL emphasizes that only one master provides the service at any one time, and the standby only provides read service but not write service.

Q: what are the network requirements for multiple replicas in the same city and multiple SET for the IDC in the same city?

A delay of less than 5 milliseconds.

Q: if two strong synchronous masters and slaves can set one of them to return?

The default mechanism of A:TDSQL strong synchronization is to wait for a reply from a standby machine with strong synchronization.

Q: if the middle node is dead, will the remote node automatically connect to the primary node?

A: of course.

Q: what are the advantages of strong synchronous and semi-synchronous replication?

A: compared with semi-synchronous replication, the most intuitive understanding of strong synchronization may ask, isn't strong synchronization changing the timeout of semi-synchronization to infinite? In fact, this is not the case. The key to TDSQL strong synchronization here is not to solve the problem of standby response, but to solve how to ensure high performance and high reliability after adding the mechanism of waiting for standby. In other words, if the performance is not modified on the basis of native semi-synchronization, and only the timeout is changed to infinite, the performance and asynchronous ratio can not even reach half of asynchronous. This is also unacceptable to us. It is equivalent to sacrificing a large part of performance for the sake of data consistency.

The performance of TDSQL strong synchronous replication is optimized and improved on the basis of native semi-synchronous, so that the performance is basically close to asynchronous, and can achieve zero data loss-multiple copies, which is a feature of DSQL strong synchronization. In the last live broadcast, we introduced how TDSQL achieves high performance and strong synchronization. For example, after a series of thread asynchronization, the thread pool model is introduced, and a thread scheduling optimization is added.

Q: which is used in the arbitration agreement?

A: majority election.

OK, if you have any other questions in the future, you are welcome to communicate in our technical exchange group. Today's live broadcast is over. Good-bye. Thank you!

Previous recommendation

Live broadcast Review | interpretation of the architecture principle of the financial level capability of Tencent distributed database TDSQL

Special experience of cloud database

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.