Live Review | Database operation and maintenance is no longer difficult, database "autopilot" technology has arrived 07/13 Update SLTechnology News&Howtos

Live Review | Database operation and maintenance is no longer difficult, database "autopilot" technology has arrived

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Tencent Cloud Database domestic database online technology salon is in full swing. Hao Zhigang's sharing has ended on March 26th. For those who have not had time to participate, don't worry. Here is the live video and text review.

Follow the official account of "Tencent Cloud Database" and reply "0326 Hao Zhigang" to download LVB and share PPT.

Instantly find the problem from 100 million SQL: TDSQL Automation Operation system _ Tencent Video

Preface

The "Red Rabbit" platform is one of the products and services provided by TDSQL. It provides all the operation and maintenance functions of TDSQL and the display of hundreds of database status monitoring indicators from the administrator's perspective, so that more than 90% of the daily operations of database administrators can be completed through the interface, and it is more convenient to locate and troubleshoot problems.

Bian Que system is a product launched by TDSQL for cloud market to automatically analyze database performance / failure and provide optimization / solution for users. It provides a collection of intelligent tools, including data acquisition, real-time detection, automatic processing, performance detection and health evaluation, SQL performance analysis, business diagnosis and so on.

In the daily application of database, we often face abnormal problems such as the rapid growth of storage capacity and performance requirements. For SQL problems that cause database anomalies, Bian Que can help users quickly carry out intelligent detection and analysis through one-click diagnostic analysis, quickly locate the problems, and give optimization suggestions at the same time. With the help of Bian Que, DBA can be freed from the daily complicated database operation and maintenance work. In the process of skyrocketing demand for supporting Tencent conference and database performance problems, Bian Que intelligent operation and maintenance staff helped DBA quickly locate the problem SQL among 100 million SQL, and provided optimization suggestions to nip the database performance problems in the bud in a timely manner. The system shows that 99% of the SQL has eliminated performance bottlenecks after optimization.

In the process of iterative evolution, "Bian Que" precipitates the massive knowledge base of operation and maintenance rules accumulated in the practice of Tencent database, which can help DBA quickly detect daily common anomalies. Combined with Tencent Cloud's massive data and machine learning capabilities, Bian Que can actively analyze and detect unknown problems, and inform customers that as far as possible, most of the anomalies will be warned before they occur to minimize the risk. Improve DB's ability to continuously and reliably serve a variety of different business scenarios.

The combination of "Red Rabbit" and "Bian Que" can not only meet the fine operation and maintenance of high-star business, but also easily deal with a large number of ordinary database operation and maintenance needs, and better help users to reduce operation and maintenance costs.

Hello, everyone. I'm Hao Zhigang, senior DBA of Tencent Cloud TDSQL. The theme shared today is TDSQL Automation Operation system. Students who have listened to our previous series of courses should have a more detailed understanding of the architecture and principles of TDSQL. When we return to the real work to deal with all kinds of problems, we still have to immerse ourselves in solving all kinds of operational problems. So today we introduce the automation operation system of TDSQL and see what the TDSQL automation operation system can bring to the vast number of DBA or operation and maintenance personnel. In fact, the TDSQL automation operation system can help DBA's work a little more relaxed and relaxed, rather than the hard and tiring work imagined. I hope it will be helpful to everyone.

My speech is divided into four chapters:

The first chapter is the TDSQL automation operation system, including the architecture, supported features and so on. The second and third chapters will be based on the two major transaction systems of DBA-daily transactions and fault analysis transactions, combined with some typical cases to introduce how TDSQL automation operation system helps people to solve these transaction problems efficiently, conveniently and automatically. The fourth chapter makes a simple summary.

TDSQL Automation Operation system

The first chapter mainly consists of three parts. On the one hand, it combs the daily work of DBA. The second part introduces the TDSQL automation operation platform "Red Rabbit". Third, introduce the database background how to achieve automatic operation, introduce the principle behind things, to help you understand how TDSQL is how to achieve automated operation.

1.1 the daily work of DBA is complicated and tedious. Red Rabbit got it with one click.

When you see this picture, you will be very upset. This is a daily DBA to deal with all kinds of problems: for example, a business needs to be online and an instance needs to be created; for example, if a machine needs to expand its capacity, it needs to apply for permission urgently; if you accidentally make a mistake, you need to roll back quickly; or the table structure needs to be changed. In this picture, DBA has to deal with a large number of complex problems on a daily basis. We can also sum up two characteristics: the first is that each DBA has its own set of logic for dealing with these. For example, if I make a DDL, maybe this DDL can be operated this way, and another DDL needs to use another tool, which is not standard and stable; maybe the method used today will not work tomorrow-it is difficult to guarantee its stability.

Second, we need a lot of background operations to do this, which is very frequent for DBA. On the one hand, we may be used to this kind of operation, but the risk brought by this way will be greater. Do we feel flustered when we hit the next order? what are the consequences? In the past, there was no unified platform to ensure that it was very safe to do this.

So TDSQL is committed to solving these two difficulties:

The first is that the process should be standardized.

On the other hand is the improvement of efficiency.

We don't need to operate in the background, but we can solve most of the transactions of DBA with a little bit of foreground.

The daily work of DBA can be divided into two categories: one is daily affairs, the other is fault transactions. Daily classes are the various requirements of the usual business, and DBA can do things through some scripts and automation tools; fault classes are, for example, when we encounter a problem, "the database is not connected", "the database is very slow", and we need to see why.

Let's take a brief look at everyday classes. In addition to the instance creation, DDL online changes and instance offline mentioned just now, although this is not commonly used, it is also possible to encounter database operation over a long period of time. In addition, it also includes the business outage, erasing the database and storing other data, as well as parameter adjustment and tuning-I think this timeout may not be set properly, the business side should be changed in this way, and so on. And capacity expansion is even more inevitable-the amount of business data is particularly large, or the number of requests is particularly large, and it is necessary to expand the capacity of the instance if it cannot hold up. There are also read-write separation, redo standby, backup manual switching, online SQL.

Fault category refers to the type of problem diagnosis: the business says that the DB cannot be connected, so help to see why; there is also monitoring and early warning-if there is an anomaly in the DB, we have to automatically detect the problem, otherwise we will be very passive. System analysis-the database may run slowly, but it does not affect the usability, to see if there is room for optimization; automatic disaster recovery switching is to ensure the high availability of users; data rollback to prevent data deletion; daily inspection is targeted to find some potential problems.

The above outline introduces some transactions, but these do not fully represent the features supported by the TDSQL automation operation platform, these are just a few more important examples. Back to the point, how does TDSQL automate these things? TDSQL presents these functions on the database operation platform "Red Rabbit". All users, including DBA, can complete all the above operations on Red Rabbit, and they are all automated; Red Rabbit will communicate with TDSQL, and Red Rabbit will send the process to the TDSQL,TDSQL cluster in the form of a task. It will automatically help the user complete this task and feedback it to Red Rabbit, and then the user can see the result of this operation.

Therefore, the whole process of Red Rabbit platform is to encapsulate every daily transaction and fault analysis transaction into an automatic process, and then get through the needs of users and the process of TDSQL background operation, so as to help people use the platform to solve the problems encountered in daily work efficiently and safely.

1.2 TDSQL- automatic processing principle of Red Rabbit

Next, it introduces how to complete the automatic process processing in the TDSQL background, including the automatic processing flow and the core modules.

If you have heard the previous series of sharing, you should know something about this picture, but today's diagram is slightly different from the previous architecture diagram, let's take a look at it in turn:

The user's request is sent to the backend OSS in the form of a task on the Red Rabbit platform (OSS is the external interface of TDSQL), and OSS will do a distribution and write some of the tasks in the MetaCluster. After the MetaCluster is written into a task, there are two modules on the left, scheduler and onlineddl, which monitor and listen to their respective task nodes, and deal with new tasks.

OSS sends the task of problem analysis type directly to Bian Que, the intelligent analysis platform in TDSQL automation operation system. Bian Que is responsible for intelligent problem analysis, which uses the data collected by monitoring, such as DBA status, active / standby switching and other TDSQL monitoring data, as well as Bian Que will actually visit the data node to collect data that it believes needs to support the analysis instance. Bian Que is a new module in the TDQSL framework. Another module is onlineddl, which handles DDL requests independently. Except for these two request types, the other tasks are basically accomplished by the scheduler module.

There is a line in the lower right corner where each node of the Agent-- is monitored or even completed by the Agent process module, and scheduler cannot directly contact it, so Agent will also assist in completing the process just mentioned. It also processes the information it needs through MetaCluster, and then responds back to scheduler or even other modules after processing.

So this diagram has three core modules to deal with the daily process: one is Bian Que, and this is the problem analysis module. Onlineddl handles DDL requests, as well as the handling of roles and tasks behind each module of TDSQL. This is the end of the introduction of TDSQL automation operation process and framework.

TDSQL daily transaction automation

Let's move on to the second chapter, TDSQL daily transaction automation.

We just mentioned that we divide DBA tasks into two categories, one is the daily class, the other is the fault class. Let's start with the automation of daily class processing. This chapter selects two scenarios that are very common every day and shares how TDSQL solves these problems.

The first is to redo the DB node function, the second is the online DDL function, and the third is how to ensure the security of TDSQL in the automated process.

2.1 redo the DB node function

The first section redoes the DB node. You might wonder why redo the DB node? This scene is quite common, although it does not happen every day, but it happens from time to time, and this thing is also more important. For example, if the machine fails, it may take some time to repair the machine. If the machine needs to rejoin the cluster after repair, the data may be lost; or the data is too old to catch up, and we need to redo this data node; in addition, if stutter cannot be restored, we need to redo the data node to restore its function.

How do you do this? We can see a picture like this:

First, the system initiates the redo process-- this process is done on the Red Rabbit, and the Red Rabbit sends the task to the TDSQL cluster. The TDSQL cluster has four steps for this task:

1. Why redo the DB node? Because there may be some data left on the machine, we need to clear it and delete the speed limit. two。 Pull the mirror image. Both logically and physically, it needs to be pulled to the node as a benchmark for the data. 3. Then load the image. 4. Finally, the synchronization is restored.

Perhaps everyone in the daily operation and maintenance or when dealing with things are this process, it is basically in line with everyone's habits. The difference is that it may have been handled by hand in the past, but the Red Rabbit can be sent with one button here.

The whole process has been optimized very well:

Red Rabbit provides primary node protection, because if it is a 1-master-2-standby architecture, in order to prevent errors, the system limits the ability to redo directly on the primary node. In addition, real-time node information is formed, such as whether the node is in a failure state. How much delay?. Through this optimization, we can judge whether it is the right node in real time.

Take a look at the steps to reinstall DB. The first step is to delete the speed limit, and this data is often very large, hundreds of gigabytes or even T, so we need to control the speed. If you delete it quickly, it will result in a higher IO and a multi-tenant architecture on one machine, which may affect the normal operation of other instances, so you need to delete with a speed limit. On the other hand, the data process will be reinstalled after deletion, and the DB parameter will be pulled automatically during installation. Because many parameters of DB may have been changed during operation, the parameters should be kept consistent with the original parameters after installation, so the installation process should be pulled automatically. Moreover, sometimes the parameter list is very long and there are more than a dozen, so manual operation is easy to make mistakes and requires a lot of work.

In addition, pulling the image step is the most time-consuming and important step, in which three optimizations are made: the first is to select the optimal data source, for example, in the case of one master and several slaves, each slave has a delay. We may choose the one with the least delay, which is up-to-date, and if it is a standby, we will give priority to the slave. This process may have an impact on the reading and writing of the business, so choose the best data. The second pull process-for example, many processes can not be pulled at the same time, one is that the Nic traffic will run full, and the other is due to a large number of data writes, that is, the IO load is relatively heavy. So repel each other so that the impact is minimal. The third is to do compression acceleration-this place mainly lies in the data source. The system will optimize and compress the images pulled by the data, and then transmit them to the nodes that do it. On the one hand, the advantage of doing this is to reduce the pressure on the network. The compression ratio is about 1/3 to 1/4. At the same time, we can accelerate, after all, the transmission is relatively small, such as compressing 1/4 to transmit 100 megabytes of data is to pass 400 megabytes of data, which is very helpful to improve efficiency.

The final step is to establish synchronization, which is mainly to confirm the synchronization point and restore synchronization, where TDSQL will help you do it automatically.

So as you can see, the whole process for users only need to click "initiate redo" on Red Rabbit to automatically complete the whole process without excessive intervention.

Let's look at an example of redoing DB.

The above figure is a record: you can see that there are three DB nodes, and the redo node is in the lower right corner. Click the "redo standby" button to enter the process. This page can display the status of two standby nodes in real time. We can see that the delay of the first slave node is very large. This is the abnormal node that we need to redo to prevent you from misoperating the wrong node. You will be told "redo successfully" after a period of time after submission. The whole process is over.

2.2 online DDL

Let's take a look at the online DDL function.

Manipulating DDL is a very common application, especially in scenarios where the business changes frequently and the table structure changes frequently. Why mention online DDL? If you are facing an ordinary small table, you can directly do DDL, but if you are facing a table with a large amount of data, such as tens of gigabytes or even hundreds of gigabytes, what if you want to change the structure of the table? At this time, it is likely to affect business requests, so we propose to do online DDL.

On the Red Rabbit, the online DDL is also very simple: we submit the request on the Red Rabbit, and then transfer it to the TDSQL module for implementation. There are two steps-those who are familiar with the database should know better. One of these two steps is responsible for copying the data, and then the table is synchronized and then the table is cut-the old and new tables are switched.

What does TDSQL do in this process? Red Rabbit can customize the start time of DDL. Then why are you doing this? Although DDL is online, it also involves copying data, especially when the business load is relatively high. We hope that when the business is not busy, the business cycle of the day is generally fixed, that is, when the business is at a low point, so the platform support can customize the time, such as initiating the task during the day and running the task at 01:00 when the business is undervalued.

Copy the data. I just mentioned that copying data may have an impact on business consumption, so TDSQL will detect these two indicators: standby delay detection and active link detection. After that, it will be paused until it returns to normal. These two indicators can also be customized at the front desk, but the system has recommended default values, which generally do not need to be changed.

The other is the table-cutting process. This process involves switching between new and old watches-cutting the new table to the name of the old one.

The table-cutting process involves two functional applications: table-cutting lock detection and protection, and free choice of table-cutting mode. First, our common daily scenario is that a large transaction is accessing the table before cutting the table, and the query hasn't come out for half an hour. At this time, if you want to do the table cutting operation, you will not get the metadata lock, and at the same time, it will block the later business request, and the later business request will wait for the previous table cutting process to continue. So TDSQL does a table-cutting and locking protection according to this scenario-- that is, before we know to cut the table, we need to see if there is a large query on the table in the request, and if so, we will not cut it for a while and let it be completed first. We will not kill it directly. If you start cutting, the time is very short, and it will affect the user for at most one second. If the table cutting is about to occur and a request is preempted to prevent the table from being executed, the system automatically times out and does not affect the subsequent business SQL. After returning to the data synchronization state, table cuts are initiated again after a period of time, and the interval becomes longer and longer until the entire DDL operation can be completed.

Another free choice function means that the table cutting mode can choose between manual and automatic table cutting: automatic completion is also protected by locking; manual table cutting means that the table is not cut immediately after copying the data, but DBA can manually initiate table cutting operations in the foreground.

Let's look at an example of online DDL:

The picture on the left is the online DDL page, through which you can see the table structure, and you can click the "Edit" button to modify the fields and indexes. On the page on the right, we can see that the parameter settings just mentioned can be customized, or you can choose the default value, and click "confirm" to complete the whole process automatically. If you choose to cut the table manually, you can choose the appropriate time to complete the table cutting operation.

2.3 Security of TDSQL Automation Operation system

Let's take a look at the security mechanism, the process is automated, and we need to make sure that every process is safe and reasonable.

Security is not limited to these. Some common scenarios are selected in PPT.

Permission requests are very common: if a user has applied for password An and the program is already running with this account, if you re-apply for this user and use a different password, the system will automatically check so that the old password will not be overwritten.

The second type is onlineddl automatic protection.

The third is the instance offline, which we have done an isolation timing deletion function. We can first carry out isolation, and after isolation, requests to access the database will be rejected, but the isolation status can be restored in time, so we are equivalent to putting it in a recycle bin. The data has not been cleared, but the business cannot access it. Clean up regularly after a period of time. For example, if there is no business feedback after 7 days (this length of time can be customized), you can clear it and go offline safely.

Redo the DB node, in which TDSQL provides a functional mechanism for the protection of the primary node. Capacity expansion will involve TDSQL expansion: switching data to another data node, the new data node has no effect in the process of doing data, because the business request still accesses the old instance node, but in the last step, route switching is the only one that will affect the business in the expansion process, so TDSQL protects the process-- it can choose the switching mode independently. It can be switched manually or automatically. Manual switching of business can be observed in real time, and problems can be reported in time. Routing switching may take several steps. If the intermediate process fails, it will be rolled back automatically, which will not affect the business, so it is also a protection for capacity expansion.

Backup: no intervention is required, TDSQL will back up automatically, and mutual exclusion will be detected during the backup process. Finally, there will also be a locking mechanism. Although it is relatively short, locking will also fail if the business request is longer. If the locking time exceeds a certain time, the backup will be stopped automatically, and the backup will be launched again, which will not affect the business. In other words, the backup does not affect the normal operation of the standby machine.

To sum up, TDSQL provides a lot of security measures for automated operations, ensuring that every process is guaranteed at key nodes, especially in the process that may have an impact on business requests and visits. This is also the summary of various experiences of TDSQL operation for such a long time, and the result of continuous optimization, so you can rest assured to use it.

TDSQL intelligent fault analysis platform

Let's move on to the third chapter: TDSQL fault analysis automation.

We have just analyzed that the second kind of problem encountered by DBA every day is mainly how to locate and analyze the problem when it is found.

3.1How does TDSQL "Bian Que" help DBA to improve its fault location ability

DBA is often very irritable in the face of failures, and there are many kinds of problems. In the final analysis, there are several difficulties in locating problems: the first is that the experience and ability of DBA has a great impact on problem location, and many excellent DBA can become excellent DBA through continuous accumulation of fault experience. Second, when we locate, we often have to log in to the background through a variety of authentication to view a comprehensive analysis of various indicators, so the efficiency is very low. In fact, many problems are repeated occurrence, repeated discovery, but we have to repeatedly locate and solve.

Therefore, we hope to precipitate the experience of DBA fault analysis to the Red Rabbit Intelligent Operation platform through the "Bian Que" platform, so that it can automatically analyze and find these problems, so as to bring efficient and convenient experience for database users and DBA. If there are new ideas, new problems, including new reasons, we will continue to precipitate to this platform to do the cycle, the platform will gradually become very powerful.

3.2 "Bian Que", an automatic fault analysis platform for TDSQL

Bian Que has four main functions: availability analysis, reliability analysis, performance analysis and other analysis, and other analyses are also being strengthened. The available analysis is mainly carried out around the active / standby switching scenario; the reliability analysis is mainly based on the physical examination report of a wide range of scenarios to analyze the current problems in the database, which can be used to analyze the status of DB like the back of your hand.

The performance analysis is aimed at a scenario in which the database runs slowly. We can roughly summarize these categories, such as hotspot tables, large transactions, lock waits, long transactions, and so on. The next layer can analyze the time consumption of SQL transactions, including the check and optimization of SQL, to see if there is anything wrong with SQL.

The lower layer depends on the data layer, the upper module is the data collection method, and the bottom is the data storage mode. For example, the audit log collects TDSQL data, while the DB status includes DB snapshots, transaction information, hidden status, etc.; synchronization information includes table structure information, such as table structure information. For example, if the table structure is unreasonable, you should first pick it out to see which is unreasonable.

There is also the module responsible for monitoring the DB, including various indicators that can be seen in the Red Rabbit foreground, and you can also see the slow query and the process of active / standby switching, including key indicators such as whether the handover is successful or not, switching point, switching time point and so on.

The lower right corner is the operating system status, including IO memory, CPU, and so on.

Looking from the top down, this diagram first analyzes what causes usability, and then there is a logical analysis layer. Finally, a new data access layer is added. Through this picture, you can intuitively understand the way Bian Que works and his internal logic.

Next, let's take an example of Bian Que's three major functions to see how it does fault analysis.

3.3.The DB usability Analysis of Bian Que Database Intelligent Analysis platform

TDSQL distributed database is usually an one-master-two-standby architecture, and TDSQL's Agent will periodically probe DB. Probing means to simulate the user's request, establish a TCP connection, and then execute the query and write, such as the query of the monitoring table, to simulate the user's request to see if it is normal. The availability of TDSQL lies in probing exceptions. If an exception occurs in DB, the switching process will be initiated automatically.

This is an automated process, but after the cut, we need to see why the switch was triggered. This boils down to why the probe failed at the switching point in time.

Usability problems are attributed to the failure of DB Agent probe, which can be divided into three categories: disk failure, DB restart and resource exhaustion. Let's look at the causes of these three types of failures:

If the disk fails for a long time, there will be a failure. We need to analyze the log information to see if the disk is abnormal at the time when the master / slave switch occurs. One can be judged by analyzing the IO performance-when the disk fails, especially when the SSD performance is exhausted, it will cause IO exception, the IO is very high, but the read and write performance is very poor, the read and write can not be performed several times per second, and the service response time is very long. DB restart depends on real-time reporting of DB startup time. As long as the startup time changes, we think that DB has been restarted, which can also be used as the cause of DB restart failure. In the fault analysis of resource exhaustion, the disk IO analysis is different from the previous IO. Disk failure IO is caused by a disk failure caused by a normal request, which may be caused by a large update query. It also includes scenario analysis such as thread pool status and large transaction status. 3.3.1 DB usability Analysis: big transaction issues

Next, let's look at the usability analysis problems caused by large transactions.

When we analyze the request of the main DB, we just mentioned that there is an Agent responsible for activating the heartbeat periodicity, and there will also be business requests-suppose a large table in this place deletes 1000W rows. These issues are combined by TDSQL into a group submission. It is conceivable that a large amount of binlog will be generated after the submission, and as far as we know, it may produce 5G, 10G or even dozens of gigabytes. Because a transaction must be in a binlog, it is a very large binlog. What are the factors that lead to a large binlog? The whole process will find that it takes a very long time. When probing, there will be a time and frequency limit, after which it will be considered a failure. If the probing failure is submitted for one minute, the master / slave switch will inevitably occur, because many heartbeats may have been reported to the arbitrator DB has failed.

Through analysis, we can see that this type of fault has several characteristics: heartbeat write timeout, generation of large binlog files, sudden increase in the number of lines affected by Innodb, and transaction prepared status.

By extracting these four features-- all four features are consistent, we can think that it is caused by a large transaction, which leads to switching. The TDSQL automatic operation platform analyzes the failure process of master / standby handover and can generate an analysis report with one click.

3.4 DB performance analysis

Next, let's talk about DB performance analysis.

It is very intuitive that the execution is slow. According to experience, we can be divided into four categories: network problems, TDSQL itself problems, resource problems and locks. Network problems are easy to understand, network card traffic, network volatility; SQL problems include index analysis, SQL analysis; system resources such as CPU, IO, Swap activity and lock waiting. The content that needs to be analyzed also relies on the collection of data to complete the operation.

3.4.1 DB performance Intelligent Analysis: lock waiting

Next, let's take a look at the design idea of DB performance problems caused by lock waiting.

The one on the right appears to have two conversations, which is a typical lock conflict conversation: first, the begin opens the transaction. Session 1 is updated in the first second, and session 2 is updated in the second second. After a while, conversation 1 is complete-the lock may stay at two points in time, which is most likely to happen when the DB is in a locked state. Here are simply two sessions, thousands of which are locked while waiting for someone to execute SQL. If you analyze the DB on the spot, you may go through such a process:

At the second point in time, the session has been committed, session 2 takes a long time, and the SQL executes successfully at the moment of submission. Session 2 has timed out as a whole, and the session timed out one hour before the time point 2 business scenario. Take a look at what happened at that time. Time 1 is very simple. Those who are familiar with DB are familiar with this method. The table below is the transaction table and the lock waiting table. Through the relationship, we can find out that session 1 is not submitted and session 2 is blocked, so this scenario is the easiest to analyze. Usually record the relationship of the three tables to query, see which session has a problem and then kill session 1, in the business to see if there is a problem.

On the Red Rabbit platform, these can be done with one click, and you can see what the live case is by clicking "Real-time Analysis" on the Red Rabbit. We have some suggestions to kill the session and return to normal. Of course, after this, we have to find the business to see why the transaction has not been released for such a long time.

In the second scenario, session 1 has been submitted, what if there is no failure site after the analysis? We can take a look at the fault characteristics here, which are characterized by the timeout of the updated table in session 2, or the balance time is long. It is possible that the timeout is in T1, or it may be done in T1 for a long time.

Our goal is to find out which session is blocking session 2. First of all, session X holds the row lock of the T table at a point in time. Only if it has this condition can session 2 be blocked. We can use SQL log analysis to find out how to find the session. I just mentioned that SQL has all the information, including client-side IP. Because many requests in SQL logs are interlaced, such as opening one transaction to execute one SQL and another to execute SQL, it is difficult to analyze the relationship of a transaction because many transaction connection requests are interlaced. So the first thing we need to do is to restore things according to the client's ip port aggregation, aggregate by dimension, and then we can know all the execution data of ip port. We will analyze the SQL in accordance with the law and submit the time.

In addition, we also want to extract that there may be various query update operations in the middle of the transaction, which table locks are held by these SQL, and it certainly has no effect on session 2 if it does not have a lock. Then we come to the conclusion that the first is the execution time of the transaction, when to start and when to end. Whether the lock list of transactions may cause the lock blockage of session 2, and the interval time of SQL, which also helps the business to see whether it is reasonable to see what happened in the middle of the long interval between two transactions.

There is also the time-consuming of SQL, which also includes the time-consuming of each SQL. One SQL has a very long execution time, and session 2 is indeed locked. We need to find out why the execution time is so long.

From these information tables, T1 time points can be found out that session X caused the lock on session 2, and then see why session X performed unreasonably, at least to see if the business is normal. Let's look at another case, including the lock at the point in time to see which session caused the lock wait, and then we will look at the interval including information, and the session used to locate caused the lock wait.

3.4.2 Intelligent Analysis of DB Reliability

DB usability analysis means that the state of the database can be known in advance, including system status, table space distribution, index, deadlock diagnosis, lock waiting diagnosis, slow query analysis, DB status check and so on. We can see such a case: the database score is very low, after analysis, we can see what is the problem-excessive CPU space? A lot of tasks? The following status information can help you to see if there is a problem with DB and whether it can be found and solved in advance.

Usually, you can make a diagnosis if you don't know much about the operation status of an important DB. Of course, this system can also do automatic diagnosis, and it also supports a daily forecast for important DB. This is an introduction to DB's reliability analysis.

Summary

Today, I mainly share several parts of the content:

To sum up, TDSQL automation operation system can help you standardize, automate, intelligentize and lighten your burden on things that are very cumbersome and need to be operated manually. Because we have made a standardized precipitation, many requirements do not need to be done by DBA itself, based on the TDSQL operating platform, the business can also be solved directly. Some daily operations can be liberated to the business through the rights management of Red Rabbit.

That's about all for today. Thank you.

Quan A

Q: is there a multi-activity mechanism?

A: we have a multi-activity mechanism and a strong synchronous replication mechanism to ensure data consistency.

Does Q:online ddl keep the two data consistent?

A:online ddl is modified based on tools. To put it simply, update class operations are written into the new table in real time, and then the original data is overwritten into the new table in batches. After the data is covered, the data of the two tables are the same, and the trigger can synchronize the data requested by the business in real time until the end of the table-cutting process.

Q: for a large transaction, what can be done at this time if you don't notice that it took ten minutes to find out that the transaction was not finished?

A: if the session transaction is terminated, it will last for a long time, and it will take a long time to wait for the transaction to last. At this time, I suggest killing it. If you don't rest assured, the big transaction mentioned just now is very likely to trigger the active / standby switch. We will force switching to ensure high availability.

Q: does deadlock detection pass positioning?

A: it's not a deadlock, it's a lock waiting, it's that neither transaction can be executed. Our example just now is a transaction that can be committed, just a transaction that hasn't been committed for a long time. Session 2 has been waiting and timed out at some point in time. Before this point in time, a session must have locked the table. Our goal is to find out whether a session at a certain point in time before session 2 holds a lock on the table. We use the engine log according to the table information and time point, which records the SQL execution information of all users. We can use the information of this table to analyze before and after the lock timeout, mainly whether there is a previous session hold. Of course, this is a screening process, at that time there will be multiple sessions, this session is to do a screening, and then see if the conversation is reasonable. Because there is no DB failure site can only be seen by the transaction information that occurred.

Previous recommendation

LVB Review | Random migration, lossless migration, which is actually very simple

Special experience of cloud database

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.