LVB Review | delivery of TDSQL 04/20 Update SLTechnology News&Howtos

LVB Review | delivery of TDSQL

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Tencent Cloud Database domestic database online technology salon is in full swing. 0622 Bi Hanbin's sharing has ended. For those who have not had time to participate, don't worry. Here is the live video and text review.

Follow the official account of "Tencent Cloud Database" and reply "0622 Bi Hanbin" to download LVB and share PPT.

one

Live Review: https://v.qq.com/x/page/v31023ovs5l.html

Preface

The whole deployment process takes only 9 minutes at the earliest, and the TDSQL global flexible deployment practice

In order to help developers better understand and learn distributed database technology, in March 2020, Tencent Cloud Database, Yunjia Community and Tencent TEG Database working Group specially launched a 3-month domestic database topic online technology salon, "here are all the domestic database secrets you want to know!" Invite dozens of senior database experts of goose factory to deeply interpret the core structure, technical implementation principles and best practices of TDSQL, CDB/CynosDB and TBase database developed by goose factory online every Tuesday and Thursday night. This article will bring you the seventh live review to share the deployment practices of TDSQL.

Hello, everyone. I'm Bi Hanbin, TDSQL DBA of Tencent Cloud. This time we are sharing three aspects around the topic of TDSQL delivery. Including the delivery requirements and challenges that TDSQL has faced, as well as the automated delivery solutions we have developed, and finally, more importantly, how this quality assurance system can continue to provide users with all-round quality assurance in the whole production process after delivery.

one

TDSQL delivery requirements and challenges: fast, flexible and secure

First of all, we would like to talk about the delivery challenges of TDSQL, which are also carried out in three aspects. the first challenge we encounter is the characteristics brought about by the architecture of our TDSQL products: first, the characteristics brought about by the continuous improvement of products-- many components, including database kernel, task distribution, cold backup center, platform alarm, performance diagnosis, etc.; second, the interdependent relationship between components is more complex. 1.1 delivery of complex product components

First of all, we divide these components from the level: red Rabbit, Monitoring and acquisition, OSS, metacluster, Bian Que, onlineddl and so on can be divided into a role, called the management node. If we talk from the business layer, for the business, if we actually access the database from the business layer, the process is: first, the load balancer layer, then the load balancer layer will do the load balancing to our SQL engine layer, while the SQL engine layer will directly access our underlying db, the underlying DB,db will also deploy agentAgent on the DB, like. These are what we call dbDB management nodes in the left column of the figure. Like the columns on the right, such as cold standby center, message queue, multi-source synchronization, etc., we are generally divided into data nodes. The log analysis platform is actually another module, which can be divided into other nodes. For example, the dependencies between these nodes are also more complex, as our management node mentioned earlier, like these. For example, the management node, in fact, the main work is to be responsible for meta-original data management, such as a lot of meta-data, such as monitoring and collection module as the core monitoring data, as well as task distribution system as the core of the task node data. The second is our DB module, DB will have some delivery interaction with the management node, first of all-all roles, not only the DB node, but also other nodes will send their monitoring information to the management node, because they will send the monitoring information to the management node. The management node will also issue some tasks, such as the customer makes some changes in the foreground, such as vertical capacity expansion, horizontal capacity expansion, master / slave switching, such as these will also be delivered to the actual DB, and will also interact with the actual DB. First of all, like sending data to the management node, the data node will do some delivery to each other with the DB node. For example, the most commonly used one is the backup and rollback of database data, which is the delivery between the DB node and the data node. The log analysis platform will also deliver it to the DB node. He analyzes the logs generated by the DB node, specifically does some user log analysis, SQL analysis, and even brings some SQL audit functions to the user, and also reports his monitoring information to the management node. So if you take a brief look at the dependencies between the various components, you can see that they are still quite complex. In fact, it is because of more complex dependencies like ours. He, this brings some difficulties to our delivery. In the early days of TDSQL, we delivered to customers through our own TDSQL product team. In fact, according to such requirements, this will bring great requirements on the delivery of manpower, even if we go, it will take more than two days to deploy a delivery environment.

1.2 Multi-scenario delivery

The second challenge comes from multiple TDSQL scenarios.

Multiple TDSQL scenarios mainly come from different objects that use TDSQL. This object can be divided into people who do not use TDSQL. It is used by individuals, enterprises, and third-party platforms, including individuals, enterprises and third-party platforms. In fact, when these different objects use TDSQL, their requirements and scenarios are also different. Take personal use as an example, if a person uses TDSQL, what he emphasizes more is that I want to know about your products, learn your products, experience your products, personal use may want me to get started with your products as low as possible, as simple as possible. Enterprises use the two main scenarios, one is POC testing, and the other is our and production scenarios. POC testing focuses on the performance and functionality of the entire product, including high availability, disaster recovery, domestic adaptation, and so on. From the point of view of performance and functionality, it will also bring different scenario requirements. In fact, we pay more attention to the entire delivery, the entire product, the entire cluster, whether there is high availability, disaster recovery capability, whether there is an one-time guarantee. Our platform access will bring more challenges, and our platform may involve some domestic delivery projects. Localization will actually bring us some compatibility problems, as well as some requirements for our standard docking and access. So because different objects use our TDSQL products, how can the needs of different scenarios be met efficiently? What we were thinking at that time was that in the delivery scenario of our TDSQL, should we do multiple branches to adapt to different scenarios, or use one branch to adapt to different scenarios? Of course, we use a branch to adapt to different scenarios.

1.2 TDSQlL delivery quality assurance: safety, compliance, multi-level real-time scanning

The third challenge is that the people we are responsible for TDSQL delivery have changed over time. In the early days, our TDSQL delivery was made by our TDSQL product research and development team, and we and our DBA classmates went to the site to deliver to our customers. In fact, in our product research and development team and DBA team, we are all a team, the team due to long-term cooperation and cooperation is to form a standard and reliable quality, his delivery quality is also guaranteed. With our TDSQL production, bigger and stronger, the scale of external promotion users continues to expand, in fact, there will be different delivery personnel will change, of course, some of them will be delivered directly by our product R & D team. Some of the products are delivered by our Tencent delivery team, and the third-party platforms within Tencent and the third-party platforms of Tencent external customers are connected to TDSQL products, and their third-party platforms are responsible for delivery. There are also customers themselves to do the delivery. Different delivery implementers, their operation and use process will bring some hidden dangers, these hidden dangers are mainly reflected in the following aspects, if not standardized, it is easy to bring hidden dangers, reflected in several aspects: the first is the security aspect. For example, the security of our environment, we know that the database scenario is a memory, CPU, hard disk, LOIO and other capabilities, are relatively demanding scenarios. Previously encountered a case, a customer in the database scenario, he (9:22) did not close, in the case of high pressure, due to performance problems, and finally brought some risk problems in a certain scenario, in fact, these are the optimization of the environment. In fact, it is not only this kind of environment optimization, but also the database process will read a lot of literature, and its maximum number of documents inherits the maximum number of files of system users. Settings like these, including database scenarios that optimize some kernel parameters of TCP, are considered as potential risks. Optimizations like these are actually considered as a potential risk. The second is monitoring. For the monitoring of the entire cluster, process, and machines, it is mentioned that there is another monitoring and automatic pull-up, that is, there are many failures at the machine level, and after the failure, the process can recover quickly. in fact, we should consider the perfect automatic pull-up system, which should be considered as a perfect system. There are other scheduled tasks, such as regularly cleaning some log documents and cleaning some historical data, otherwise the disk will be full, which is also very risky in the production environment. And finally, how do we ensure the high availability, disaster tolerance, and (10:56) capacity of the entire cluster? What I just said is that different implementers may bring different risks. in fact, in addition to the implementers, there are also released versions that need to be controlled. Sometimes we deliver this product as the first party, and sometimes we have external customers whose platform will deliver, no matter who delivers it, whether this version is a historical version or not. Will this version have some historical problems and hidden dangers? How to eliminate the hidden dangers brought by these potential old versions and detect the loopholes in these versions is also a challenge to our delivery quality system that needs to be solved. In fact, our TDSQL delivery quality service and guarantee is around some of the above aspects of issues, to achieve by different implementers, implementers to deliver our TDSQL products, can guarantee the quality of our TDSQL put into production. This is something we are doing.

one

TDSQL automatic delivery solution: global flexible deployment, real-time inspection, as soon as 9 minutes

We have just mentioned some of the challenges encountered in the delivery of TDSQL. In response to these challenges, TDSQL has precipitated a set of TDSQL automation delivery solutions.

2.1 Automated delivery plan planning

This is the architectural diagram of the TDSQL automated delivery scenario:

We just said that TDSQL is based on one branch to achieve automated delivery in multiple scenarios and complex relationships. In fact, it can also be said to be based on three branches. Our TDSQL kernel package, currently has three branches, is: based on CPU multi-branch release, currently supports X86, arm, power. In fact, in our TDSQL release package to customers, a package automatically integrates different CPU versions of TDSQLpocketpacket, which is-- based on ansible components, plus conditional testing, operating system tuning, solution of environment dependence, security specifications, and compatibility issues. What we are doing is the release package of TDSQL private cloud standard. Like this package, we can adapt to different customer scenarios, different scenarios and different environments mentioned just now.

We just divided the components of TDSQL into four roles. If you want to quickly deliver the TDSQL cluster, you just need to figure out one thing, for example, to put different eggs in different baskets. Eggs actually mean these components, which are divided into these four; the basket is the machine we prepare, which can be a virtual machine or a physical machine. First of all, let's talk about the environment of our personal experience. As mentioned just now, the environment of personal experience may: the environment of personal experience pays more attention to lower barriers and lower barriers. In fact, here we only need the configuration of a virtual machine to achieve this goal. We can then deploy the management node, DB node, data node, and other nodes on this machine. Of course, in the experience environment, data nodes and other nodes, these two functions according to the configuration of the machine, we do not have to deploy. In the test environment: this environment focuses on performance and function. First of all, from the point of view of the management node, in fact, the management node provides the management of meta-original data and the distribution of tasks, he is not very strong for performance requirements, he actually requires a stability and disaster recovery ability. In the test environment, we can slightly weaken this requirement. For example, we can prepare one or three virtual machines, configure 4C/8G ordinary disk, and configure 4C DB 8G; in the test environment, if we are going to be a DB node, in fact, we have to consider the performance of TDSQL in the DB node. Here we will recommend a physical machine. When we TDSQL do the performance test, it must be an SSD disk, otherwise our performance data will not have any reference. -- this is also determined by the scenario of the database, because SSD and ordinary disks, one is random, they are mainly manifested in random read and write ability, the gap will be larger. In terms of data nodes and other nodes, if some customers may not have such strong requirements for testing their functions, they can not deploy the functions of these nodes, and if I want to experience the functions of a complete TDSQL, then I need to prepare these machines to experience the full functions of TDSQL. If we are going to deploy a data node, we can choose a machine or three machines, virtual machines, and disks with a larger capacity as a data node; other nodes, such as load balancing and log analysis platform, are mentioned here, and the role of log analysis platform is to do some SQL audit, DB log analysis and so on. In fact, our TDSQL load balancer will be more flexible. It is located at the upper layer of our SQL engine layer. Open source's own LVS is recommended here, and of course, many customers will use F5. Finally, in environments like these, our recommendation is to deploy two nodes to achieve a disaster recovery capability. In fact, generally speaking, in order to ensure the performance of the test and the requirements of the test environment, the node module DB is the most required to ensure the performance of the test. In the end, it is the requirement of the production environment that we are most concerned about. What we require here is that the management node in the production environment can be deployed in three or five virtual machines, but three or five, preferably across three computer rooms. For example, the "1-1-1" mode or "2-2-1" mode, because our original metadata cluster is a mechanism based on the majority election to ensure high availability, if there are only two computer rooms. It will lose its own meaning of disaster recovery, so we suggest that there are three computer rooms deployed in the production environment. The SSD of NVME interface is more recommended in DB node production environment, because the traditional SSD and NVME SSD may reflect his interface performance, there will be a large performance gap. Here, in terms of quantity, the recommended number is 3000N, in fact-- in fact, this is the amount of data of the TDSQL cluster in the production environment that we are going to evaluate. Because we, TDSQL is a distributed database, its data level can be expanded according to the number of machines you implement. For example, let's assume that the customer has 3T data, if (19:39) a single physical machine is 1T, a (? ) what is done in the set is one master, two backup and three nodes, so we need three (? ) set, three (? Set can support 3 terabytes of data, and there will be redundancy of two replicas. We need 9 such machines for these numbers of DB nodes, and these three set will form group shard. Physical machines are also recommended for data nodes. Here, data nodes also need to consider disaster recovery capabilities in the production environment at the same time. Therefore, we recommend more than three machines, so one machine is not recommended. Consider the disaster recovery capability of data nodes. In addition, a high-performance disk is needed to ensure the efficiency of rollback and backup. Finally, the physical machine is also recommended. Accessing the access layer on the link is a very important layer. We strongly recommend pushing the physical machine to improve its stability.

2.2 TDSQlL Automated delivery Features and requirements

As a matter of fact, we also talked about the different components of TDSQL earlier, it is divided into different levels, we and how we manage these levels and so on. In the real delivery process of TDSQL, in order to ensure the delivery quality, combined with the security compliance and high availability disaster recovery considerations of financial-level scenarios, we precipitate some basic requirements and features: 1. Network: offline deployment without external network dependence, machine interworking; 2. Storage: supports single disk, multiple disks and raid;3. Cold backup center: supports hdfs and mounted distributed storage (such as ceph); 4. Machine distribution: support the mounting of servers across racks and computer rooms, and support high availability disaster recovery in a variety of machine distribution modes. 5.CPU: under the trend of localization, the current machine CPU not only adapts x86, but also includes arm and power, and one of the above is recommended. 6. Operating system: adapts to support centos, ubuntu, and many mainstream operating systems, including domestic operating systems.

In fact, when we actually deliver TDSQL, when we use our delivery program to deliver TDSQL, there are some points we should pay attention to. The first is that there is no external network dependence on our TDSQL network, because many customers, such as some financial and securities customers, cannot connect to the external network. We have solved this dependence in the TDSQL release package. All we need is a network interconnection, and there is no requirement for the network side. The second is storage, TDSQL not only supports reading single disk on the physical machine, but also supports reading multiple disks, of course, also supports our multi-disk raid, and then reads the path of this raid, these are all possible. TDSQL supports two types of cold backup center, the first is hdfs, and the second is remote mount distributed storage, such as ceph file system, which is a mounted file storage, such as NAS and NFS. We recommend that TDSQL mount servers across racks and data centers. We do IDC management for TDSQL. If we follow our specifications, when your instance is full, the master / slave nodes in the instance will have a cross-server relationship. At present, there are three kinds of CPU supported by our TDSQL, one is X86 series, this is the previous mainstream series, the second is arm,arm, which is also the architecture made by many domestic manufacturers, and the third is the current main force of power,power or on this side of the wave. At present, the main operating systems used by customers have been adapted, such as centos, ubuntu, Red Hat and other domestic operating systems, which we have adapted. The figure on the right shows our simple brief distribution relationship. In fact, we plan like this. During the delivery process, we just need to figure out how to put the eggs in the corresponding basket, and then we can achieve automatic delivery: we first select a basket, a set of physical machines is an example basket, and we put a group of component DB nodes into this basket. In fact, this completes the automated delivery.

2.2.1 flexible delivery

Of course, there are a lot of details here, and what customers are most concerned about is how I should deliver this product. What we have to do is planning. In fact, what we fill in is what customers need to do. Is to freely determine the mechanical machine distribution of the module and the size of the cluster. We, TDSQL can make single point solution and multi-node high availability disaster recovery scheme adaptively through the difference of different numbers filled in between modules. This process is that the user is not aware of the operation. For example, TDSQL just mentioned supports HDFS as its cold backup center. If our HDFS chooses a node, his system will do a point-to-point solution for HDFS. We know that HDFS's point plan is mainly composed of (25:38). If what we fill in here is a three-node configuration plan, it will automatically sense that what I want to do is a highly available disaster recovery solution. At that time, the mainstream HDFS used high availability disaster recovery solutions, one was QJM, the other was based on (? ) make the plan. We are currently using a QJM-based solution, which actually includes (26:07) high-availability solutions.

2.2.2 simple and efficient: the whole deployment process takes only 9 minutes at most

As mentioned just now, in addition to what kind of components we have to put on which set of machines after we finish the deployment plan, the second thing we need to do is to solve some of the relationships between the various components, including some compatibility issues. Let me give an example, if the deployment of the TDSQL environment is based on the localization of the operating system of the ARM domestic server, it is eager for the operating system of the arm platform. How can we adapt to different environments through a delivered package? In fact, the secret is in this configuration file: 1. Users do not need to pay attention to the interdependence and configuration management of the complex modules of TDSQL, but only need to fill in the variable file configuration according to the reality. 2. Users fill in a machine specification configuration file and a variable configuration file, which can be adapted to the operating system and CPU to achieve one-click automatic delivery. 3. The simple operation can be completed independently by users, and the automatic deployment command can be executed repeatedly. The on-site test of TDSQL production in Beijing Information and Communication Institute shows that the whole deployment process takes only 9 minutes at the earliest.

2.2.3 adaptation and integration: localization, full-stack

Customers can fill in our profile. In fact, some adaptations have been made, including our kernel package, first of all, the kernel package of our TDSQL is the kernel package of different CPU architecture. And logically compatible with each operating system and CPU for our delivery. In fact, customers do not need to care about the dependence and configuration relationship between the more complex modules of TDSQL, as long as they fill in the configuration file of variables according to the actual situation, and then they can execute the initiating command we delivered, which can be delivered automatically with one click. The whole delivery process is a very simple process. We tested the entire automated delivery process of TDSQL before. At that time, we tested the production delivery of TDSQL in an organization in Beijing. When the core delivery scenario of TDSQL was built, it only took 9 minutes to complete a delivery scenario. In fact, at this point, our core delivery process has been introduced to you. In fact, it is very simple. We put different eggs in different baskets according to our own needs, and put the components of different roles on the set of machines we have prepared. This is the first thing to fill in the planned configuration file. The second thing is to fill in the dependent variable culture, including some environment and operating system CPU variable files to help us adaptively adjust the current environment to adjust some delivery logic. The third is that we actually carry out the delivery order, which is an one-click step.

It was just mentioned that our TDSQL has also done a lot of work in localization. At present, localization has become a trend, and TDSQL has also done a lot of work in localization, from our underlying servers to memory, operating system, CPU, industry software, database software, etc., all under the guidance of relevant departments, we have contacted and cooperated with various manufacturers to achieve all-round localization adaptation from the lower level to the upper level. In the wave of localization, we TDSQL as a Tencent self-developed distributed database, as an excellent domestic database, in fact, we are duty-bound to assume the responsibility of our localization. At present, we are actually compatible with CPU and operating systems, but several operating systems have not been mentioned just now. Centos, ubuntu and suse may be the common mainstream operating systems, including Tencent's internal operating system tlinux is Tencent's internal operating system, and the winning Kirin, Galaxy Kirin and UOS are common mainstream operating systems made in China. We all have TDSQL to adapt. In addition to the adaptation of these CPU, operating systems and all domestic operating systems listed by us, TDSQL has also completed the compatible adaptation of all domestic chips and a full range of domestic servers. While completing the adaptation work, Tencent also provides corresponding technical services to help industry users better migrate to the domestic basic technology ecology. I just mentioned that some hardware manufacturers of many server CPU do localization, and we and Chaochao have also done some testing and certification, and have obtained the certification of Chaochao. In addition to Chaochao, we are also in many other localization customer projects, which may be more related to the government and state-owned enterprises, but also do these localization projects in parallel, and have achieved certain results. This is our work on localization. In terms of technical service ecology, TDSQL can not only be used as an independently released product, but it has also been accepted by many other platform vendors and partners in the course of TDSQL development, including Tencent's internal TCE, Tstack, MDB architecture and so on. TCE is a Tencent Cloud platform based on financial level, and TDSQL is also highly integrated with TCE, including deep integration with TCE from various dimensions such as deployment solution, alarm, user permissions, etc., to provide a full range of PaaS basic technical services for financial and government institutions, ensuring financial stability and high availability while completing the transformation and upgrading of high-performance distributed architecture. Tstack and MDB are also some of our internal platforms, in addition to our internal platforms, there are many customers' own platforms. In addition to customers' own business using TDSQL, some TDSQL many customer partners are doing some industry solutions, and they also integrate TDSQL into their solutions to input our TDSQL capabilities into their own platforms.

2.2.4 Security: second-level monitoring

TDSQL has made many optimizations to the delivery scenario in its development: 1. Conditional checking: first, all machines under the planned TDSQL cluster will be pre-checked automatically, including machine time synchronization, time zone consistency, port occupancy, system default sh, machine specifications, etc.

two。 Environment optimization: aiming at the relational database scenario, about 50 points of the system are tuned, and some basic dependencies are solved. Machine second monitoring: most monitoring platforms are based on minutes. For sensitive scenarios such as financial databases, minute monitoring is not enough. We have also made some optimizations in the delivery scenario. First of all, we will pre-test all the machines in the cluster planned by TDSQL, including common machine time synchronization, machine time zone, port occupation, system default sh, machine specifications. We will optimize the environment. We have just mentioned some kernel parameters of the operating system. For relational database scenarios, such as some optimization of TCB, such as some optimization of memory parameters, we have actually made some tuning and solved some technical dependencies. And did a second-level monitoring. In fact, most of the customer's own monitoring platform, including the monitoring center we provide to customers, is based on a minute level, but the scenario such as database is quite special. In fact, many problems are monitored at the minute level. The scene of the problem will be lost, and the problem itself cannot be exposed. So we provide second-level monitoring for such a scenario, and we do several dimensions, including IO, CPU, network, memory and other dimensions for the machine.

2.3 automatic delivery under multi-cluster

The previous article just talked about the delivery scenarios and delivery details of TDSQL under a single cluster, and we also introduced the delivery solution of TDSQL multi-cluster in the architecture class. In fact, in the next introduction under multi-cluster, let's take a look at how delivery is done.

The deployment system of "two places and three centers in the same city" and the architecture of "three centers in the same city" as the name implies: there are three computer rooms A, B and C in a city, and TDSQL still adopts the structure of "one master and two backups". Obviously, we need to deploy the three data nodes in three computer rooms, with the master node in one computer room and the two backup nodes in the other two computer rooms. Under the city-to-city dual-center architecture, we have two sets of clusters. The first set of clusters is Shekou, and we deliver a set of clusters. Then the Guanlan cluster is delivered to another cluster. We made an asynchronous replication between the two clusters, which is twin centers in the same city. The second is "three centers in the same city". Under the deployment of the architecture, we are in a large cluster. In this database instance, our database instance uses the method of asynchronism with IDC and strong synchronization across IDC, and then there will be a strong synchronization instance in Shanghai. A DCN replication will be done between the instances to achieve financial-level high availability disaster recovery. The architecture of "two places and three centers" is as its name implies: in one city, there are two data rooms An and B, and another city has a C server room. In the first city, TDSQL database instances are asynchronous with IDC and strongly synchronized across IDC. We need to deploy four data nodes in two data centers in the first city, where the master node and one slave node are in one server room, and the other two backup nodes are in the other. And between the database instances of the first city and the second city, asynchronous replication is used to ensure the financial city-level high availability of disaster recovery.

"two places and four centers" deployment system

The last one is the architecture of "two places and four centers", which is a strong synchronization architecture for automatic switching. We also have two examples. The first example is Shenzhen, and we are divided into three IDC. For example, one is Fukuda, one is Shekou, one is Guanlan, an example of IDC across three, we do a strong synchronization. The second example is in Shanghai, where the synchronization between the two instances is also done by DCN. It can be switched over within 30 seconds for any data center and failure, with zero data loss, stable and reliable performance, and higher availability and lower cost for businesses and users.

one

TDSQL quality assurance service: automatic inspection of the whole production process

Just now we talked about some delivery scenarios of our TDSQL, the delivery requirements and some delivery considerations for TDSQL localization and compatibility. In fact, the most important place is the last, the most important thing is how to ensure the delivery quality of TDSQL, not only the delivery quality and service quality, I will take this piece to the last chapter to introduce to you.

First of all, the delivery quality of our TDSQL is guaranteed by a program called automatic inspection. TDSQL automatic inspection program we are through three dimensions to control the quality of our guaranteed delivery.

1. Monitoring index analysis

The first maintenance dimension is based on relying on TDSQL's existing monitoring center to do some relevant index analysis from our existing monitoring system, including. At present, our indicator analysis is also divided into two dimensions, the first dimension is the index analysis of the current moment, and the second dimension is the index analysis of the historical moment. What does it mean? In fact, a question will be involved here. When we are trying to verify whether there is a problem with a cluster or a TDSQL cluster, we often have to analyze whether there are any anomalies in the cluster at this moment, whether there are some alarms, yes, whether there are some resources that are overloaded, and so on. In fact, it is often necessary to analyze historical problems, such as the curve of each index in history in the past seven days. Why analyze the index curve of the past seven days? Let me take a simple scenario. For example, my scenario is from 3: 00 p.m. to 5: 00 p.m., which is the peak of business. During this peak period, I may have a lot of slow queries of business. There are even some performance problems caused by slow queries. System how do I monitor problems at some point in history? For example, when we launched the automated inspection program, I, for example, started at 8: 00 in the morning, but it was actually mine at 8: 00 in the morning, which coincided with the business trough, and there was no problem at this time, so we need to analyze the historical indicators. In the plan, let's take a specific look at which indicators we analyze and from which dimensions we analyze. We check the connectivity of the foreground, we confirm whether the alarm has been correctly sent to the customer, and we take a look at the replication method of the instance. Our TDSQL can be replicated in several ways, including strong synchronous, asynchronous, asynchronous with IDC and strong synchronous across IDC. In fact, we have many options between replication methods. For example, we have deductible options for strong synchronization. In fact, when strong synchronization occurs, it is actually a potential risk, and we need to get this potential risk out. There is also an instance free node. When a master / slave switch occurs, a cut-free node will be generated. If there is a cut-free node, we will know that the master / slave switch has occurred in the past, which will prevent the following automatic master / slave switching mode and so on. Affect the high availability of our entire cluster. Slow query is a common cause of many performance problems, even some network problems, such as standby delay, HDFS utilization, and alarm policy comparison. In fact, monitoring is mainly divided into two aspects: the first is the collection, reporting and collection of monitoring indicators, which is the responsibility of our monitoring center. In addition to the data we got from this surveillance, we need to respond to this. The second is to analyze the monitoring data, we alarm us and the analysis that we think is abnormal, in fact, in these. Under the analysis and alarm, there will be a problem that a certain strategy will be followed in the process. We think-- what kind of monitoring data is abnormal and it is necessary to give an alarm? Of course, our former TDSQL maintained a set of alarm templates for private clouds. We also provide customers with some configurable and customized options. Customers can modify alarm policies according to their actual situation. At the same time, we also provide comparison of alarm policies based on practical experience to prevent users from making unreasonable changes and expose the potential risks of alarm policies. In this dimension, modules such as TDSQL multi-source synchronization can monitor data synchronization. The stability and performance of their current synchronization are the monitoring indicators of the alarm of each module. However, in order to prevent the customer from misoperation or unreasonable modification, we will also compare the alarm strategy here, expose some obviously unreasonable or extremely unreasonable changes, and prompt the customer to tell the customer when the alarm policy has been changed. We suggest that the alarm strategy here is risky.

In addition, we will monitor the synchronization mode of TDSQL, the synchronization of DCN and multi-source synchronization, the stability and performance of their current synchronization, and the other monitoring indicators of the alarm of each module. The first dimension is the analysis from the perspective of monitoring data. The second dimension is equivalent to the supplement to the first dimension, and the second dimension is more. We first analyze the machine-level, we do not collect the monitoring data, but directly access the server backend. We will test the machine-based LO, CPU, memory, disk, stability and so on. Stability is shown in that some servers may be old servers, for example, they have been running for five years. We need to inform customers that machines that have been running for five years may be risky, and some machines may be rewritten frequently. We tell customers that there is something wrong with the stability of the server itself from all kinds of information. From the point of view of the process level, what we need to look at is the situation of the process itself. Generally, the process is composed of daemons and work projects. Whether the working process is normal, whether the daemon is normal, and whether the ports opened by the current process can be accessed normally. In addition to the problem of the process itself, let's also take a look at the configuration files of key processes. In fact, many configuration files are related to the availability of our entire TDSQL cluster.

two。 Cluster environment

In addition, we will monitor the synchronization mode of TDSQL, the synchronization of DCN and multi-source synchronization, the stability and performance of their current synchronization, and the other monitoring indicators of the alarm of each module. The first dimension is what we call the analysis from the perspective of monitoring data, and the second dimension is equivalent to the supplement to the first dimension. The second dimension is more. The first thing we analyze is that the analysis is at the machine level. Instead of collecting the monitoring data, we directly access the server backend. We will check the machine-level LIO, CPU, memory, disk, stability, and so on. Stability is shown in that some servers may be old servers, for example, they have been running for five years. We need to inform customers that machines that have been running for five years may be risky, and some machines may be rewritten frequently. We tell customers that there is something wrong with the stability of the server itself from all kinds of information. From the point of view of the process level, what we need to look at is the situation of the process itself. Generally, the process is composed of daemons and work projects. Whether the working process is normal, whether the daemon is normal, and whether the ports opened by the current process can be accessed normally. In addition to the problem of the process itself, let's also take a look at the configuration files of key processes. In fact, many configuration files are related to the availability of our entire TDSQL cluster.

We will scan some key processes to prevent customers from manual miscorrection or artificial deletion and modification of some key configurations. In addition to the machine level and the process level, we will also do some customized scanning at the instance level, which is actually reflected in the physical examination module of the instance. We have also shared Bian Que's tools in our course before. The physical examination of an example is the interface of the TDSQL intelligent diagnosis and analysis platform "Bian Que" tool, which can provide us with some systematic analysis from an example, from various indicators such as operation, development, performance, and so on. The fourth dimension is at the cluster level. We will focus on the cluster dimension from low to high, and the highest is the cluster dimension. Under the cluster dimension, we need to pay attention to whether the machines in this cluster are synchronized and whether the time is synchronized. TDSQL requires all machines to be synchronized. Also, whether there is a backup in the source metadata cluster under the instance, whether his or her backup is normal, and whether we will manually trigger the backup of the source data cluster at this moment. We will do a supplementary scan of the first monitoring item from four aspects in four dimensions.

3. Automatic exercise

When we have no problem scanning the current cluster with various dimensions, we still have to start from the results. TDSQL will also start from the results. We will do a P0-level (highest-level) automated exercise for the entire cluster, and the scenario of the exercise is the scene of our normal operation and management. For example, it includes purchasing instances, creating users, user authorization, creating library tables, and making some table structure changes on this library table. In this example, we will do some horizontal expansion, do some vertical expansion, expand it to different machines, do some redo backup, simulate some scenarios of redoing standby, and slow query into storage, whether slow query, we can store it on the analysis page of the system, and backup and rollback. We will simulate to make a manual backup of the instance in that year. And whether this backup can be rolled back to the point we backed up before, and to ensure that the entire rollback and backup process, his data is consistent, and so on. Finally, our system will delete the purchased instance, it actually realizes the closed loop, and does a closed-loop automation drill for the P0-level scene.

To sum up, TDSQL automated inspection solution from these three aspects, from our index level, from replenishment to scanning the entire cluster environment, and our automated drills, these three dimensions ensure that our entire delivered cluster is secure, stable, reliable, and highly available OK, and will generate a quality report to our customers and our TDSQL product development team for reference.

In addition to the quality assurance of our TDSQL, in addition to the technical guarantee scheme, we will also do some product TDSQL and precipitate a lot of production work to help users use the distributed database quickly and conveniently.

For example, when our customer is from 0 to 1, he is fully delivered. From 1 to more after delivery, it is the process of operation and use. In the process of delivery and operation, we will bring a lot of problems, such as how to deliver? Just now we just talked about some characteristics of delivery, the concept of delivery, how to operate it? In fact, we will also do some output of production documents. The first document is that most of our delivery and operation is on our TDSQL product documentation, and it also includes our patrol inspection, the automated inspection scheme we just mentioned, and fault handling. When we encounter an alarm and failure, how do we deal with it, how to interpret the fault, there are some front desk operation instructions, the abnormal interpretation of our alarm, our daily change expansion, and so on, it is on our product documentation. If we want to do some POC testing and we need to adapt some scenarios, we may have to consider the development issues on the business side, and we have development guidelines for outputting TDSQL best practices. And for standardized testing, we output our POC use cases, providing performance use cases, functional use cases, and highly available disaster recovery use cases. We will also maintain the customer's information on a regular basis. First of all, we will launch a regular cluster inspection of the customer. Through this inspection, we can ensure that the customer's current and historical period of time, the customer's environment is not a problem. As mentioned just now, patrol inspection mainly conducts exercises of functionality and disaster tolerance. Through our automatic regular inspection, we will collect the customer's environment and version information, we will update this information to our customer management system, the updated information will be used to do the customer private cloud version push. The current version of the customer is automatically scanned within our management system, and if we scan the version that the customer is advised to upgrade, we will automatically push it to the customer representative, and then the customer representative will push the customer upgrade.

Finally, with our customers, in the daily operations and daily changes of customers, most of the problems that may be faced by most operations are how to expand, upgrade and deal with alarms? How to expand the capacity? Our TDSQL will provide an automatic expansion scheme for the expansion of each node, which can be expanded with one click. Similarly, the upgrade also provides the function of one-click operation of the foreground, which can not only upgrade point to point, but also upgrade the whole cluster in batches. This is also a foreground upgrade tool. On the one hand, the high availability of TDSQL lies in its own flexible architecture and disaster tolerance, as well as strong data consistency.

In terms of usability, TDSQL provides automatic alarm processing solutions. In fact, the availability of TDSQL lies in its own architecture and disaster recovery capability, in its strong consistency, and in our monitoring system. The problem of alarm will inevitably occur in our monitoring system. Whether the alarm problem is handled in time or not, the way of handling it actually affects the availability of our TDSQL cluster. In fact, on this issue, we have also done a lot of exploration, we not only need to balance the actual alarm processing, alarm interpretation of the workload, but also to help customers ensure the quality of the entire cluster. We propose an automatic alarm analysis here, which can automatically deal with some alarms, reduce the workload of customers' own line network operation, realize automatic alarm analysis, and automatically deal with some alarms, so as to reduce the workload of existing network operation. Just now we took delivery as the core to introduce several delivery challenges encountered by our TDSQL in the history, and in response to these delivery challenges, we put forward our automated delivery plan, what are the characteristics of these delivery solutions, how we complete our delivery, the features we can use in this delivery, and what are his compatibility scenarios. And finally, we provide a series of mechanisms and capabilities to improve the quality of standardized delivery of TDSQL and customer services. For more details about our TDSQL, you can follow our TDSQL database official account. We will have some regular push articles to share with you on this official account.

Part Ⅴ Quba

Does Q:TDSQL support offline database backup? A: our TDSQL supports multiple backup methods. We can be based on physical (56:22) backups or logical backups. But the media we backup is backed up to HDFS or mounted storage. The whole backup process is actually backed up on the standby. Backup will not affect our normal business access, nor will it affect the performance of business access. Q: how can the alarm information of TDSQL be connected to SMS, voice and email alarm platforms?

A: the alarm access of our TDSQL is relatively flexible. First of all, the alarm message of our TDSQL is in the form of a text, which can be sent to any platform. There are many alarm access methods that our customers have adapted to. For example, customers have alarm platforms with HTTP interfaces, as well as some other interfaces. In fact, according to our guide manual, we just send our alarm information to the interface that the customer wants, such as HTP, and we will send a TDSQL of HPT to send a packet containing our alarm information to your alarm receiving platform. How to accept the alarm media? In fact, SMS, voice and email are still affected by each customer's own alarm platform. For example, their own customers already have an alarm receiving platform for Wechat. At this time, our TDSQL is an alarm receiving platform connected to customer Wechat. For different alarm receiving platforms, TDSQL sends different alarm messages for different voice, SMS and email.

Special experience of cloud database

For more surprise offers on ↓↓, please click here.

Https://cloud.tencent.com/act/pro/MySQLtry?fromSource=gwzcw.3180840.3180840.3180840&utm_medium=cpc&utm_id=gwzcw.3180840.3180840.3180840

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.