Tencent database expert Lei Hailin shares intelligent operation and maintenance architecture 07/09 Update SLTechnology News&Howtos

Tencent database expert Lei Hailin shares intelligent operation and maintenance architecture

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

At the DTCC2019 China Database Conference on May 8-10, 2019, Lei Hailin, an expert engineer of Tencent Cloud database, was invited to share the technology with the theme "TDSQL Intelligent Operation and maintenance platform-Bian Que Architecture and practice". The following is a live speech.

Lei Hailin was at the meeting.

First, the basic introduction of Bian Que

Bian Que system is a product launched by TDSQL for the cloud market to automate the analysis of database performance / failures and provide users with optimization / solutions.

1. Bian Que's demand background

TDSQL is a highly consistent and distributed database cluster solution launched by Tencent for financial scenarios. At present, it covers 90% of Tencent's payment business scenarios, and a large number of internal teams use it. At the same time, as a database product of Tencent Financial Cloud, it supports two cloud solutions: public cloud and private cloud. At present, it has a large number of government, banking, insurance, logistics, e-commerce and other customers, but with the continuous expansion of customers and clusters, it also brings great challenges in the operation of TDSQL.

To solve these problems, we think we need an automated fault / performance problem analysis system, which can reduce the repetitive work of DBA, precipitate our analysis experience, locate the problem quickly, bring the fastest response to our customers and improve the happiness index of DBA.

The reason why this module is named Bian Que is that it can "prescribe the right medicine" for the database and treat / repair / predict the known or potential risks of the database as the ancient Bian Que doctor diagnosed the cause of the disease.

two。 The role of Bian Que

When developing Bian Que system, with the continuous input of the expert knowledge base of DBA to Bian Que, most of the performance / fault problems of our current network can be analyzed by Bian Que, which greatly liberates the hands of DBA and greatly improves the operation efficiency. The following picture shows the basic functions and goals we set for Bian Que in the design stage. the core point is that Bian Que can achieve early warning of risks, accurately analyze the causes and solve problems in the event, and analyze historical events afterwards. Find the problem.

II. System architecture

Bian Que can be divided into six hierarchical structures in the following figure.

The resource layer mainly includes DB instances and hosts, providing all kinds of original information.

The acquisition layer will collect some necessary diagnostic information such as performance indicators, SQL logs, table structure and so on from the resource layer and send them to the storage layer.

The storage layer is responsible for persisting the information provided by the collection layer for subsequent analysis of historical data.

The index layer will extract the data from the storage layer, classify it again and form a programmable data structure, which is also the diagnostic unit needed by the analysis layer.

The analysis layer is the core logic of Bian Que, which is mainly responsible for root tone analysis and risk assessment of common database anomalies such as active / standby switching, active / standby delay, etc. by using the metadata information of the index layer and the knowledge base deposited by TDSQL itself.

The presentation layer will eventually visualize the results of the analysis layer, which can be divided into health reports and specific fault / performance / optimization recommendations.

The following figure shows Bian Que's more detailed structure. You can see what functions Bian Que has, what metadata is needed for these functions, from which levels the metadata is obtained, how the modules interact, and so on. If you want to do similar functions, you can make a good reference based on this.

Principle and practice of intelligent diagnosis

We classify the DB problems that customers often consult into three categories: usability issues, performance issues, and reliability issues.

Next, let's take a specific look at how Bian Que analyzes and solves these three types of problems.

1. Availability issu

The usability problem mainly means that the DB cannot respond to the user's request for a period of time.

As a financial-level database, TDSQL itself is highly available, and the new master switch will be automatically selected when the host is unable to continue to provide services. Here, we detect whether the master is alive by using an agent module to periodically connect DB and write data to a heartbeat table built by TDSQL, so that agent can accurately judge whether the disk is unavailable due to bad blocks, full disk or DB restart. When agent writes heartbeat failure or timeout for a continuous period of time, the switching logic will be triggered, during which time the DB will be in a short second unavailable state. You may receive DB read-only, disconnected, and other exceptions from the user side. In this case, the business often needs to know clearly what the reason for the handover is and how to avoid the handover happening again.

There are many reasons for handover. Here we list some common factors, such as triggering a bug in the kernel, causing DB to restart hung, and disk failure causing DB to fail to write. It is also possible that some SQL of users excessively occupy some CPU, IO and other resources, such as large transactions, slow query concurrently affect the user or heartbeat thread writing and so on.

In order to analyze the cause of the handover problem, the first thing we need to do is to retain the necessary on-site information to provide clues for our follow-up diagnosis.

Here, we realize the second collection of resource status information of host hosts such as top,iotop,iostat, and snapshot information such as internal processlist,innodbstatus of DB before switching. The abnormal reasons we listed above can basically be reflected in this information. below, we explain in detail how to use this information to analyze the reasons for switching, and what effect Bian Que's analysis of this problem has achieved.

From our own operation and maintenance experience, handover caused by DB failures is not common, and it is more likely due to some abnormal conditions caused by users' SQL taking up too much system resources, which can be divided into two categories: slow query concurrency and large transactions. Let's analyze the reasons for the handover one by one.

Active / standby switching caused by slow query concurrency

TDSQL uses innodb storage engine by default. In order to avoid additional performance overhead caused by too many threads running in innodb at the same time in innodb, innodb provides a parameter of innodb_concurrency to limit the maximum number of threads executed in innodb at the same time. If the customer executes slow queries with a large number of concurrent connections, these slow queries will continue to occupy the active threads of innodb. As a result, many operations related to innodb access, such as simple insert / update, are also easy to be blocked and wait for innodb processing, which will also cause agent heartbeat detection to fail constantly, thus triggering the master / slave switch.

When this happens, we can see that there are a large number of threads waiting in the innodb status message, and many slow queries are executed in processlist for a long time, so that we can analyze the pre-saved innodb status information to confirm this phenomenon, and combined with processlist to find out TOP slow SQL, we can know which slow queries concurrently cause this problem.

Reasons for active / standby handover caused by large transactions

TDSQL defaults to row format binlog in order to ensure the consistency of master and standby data. If a user performs an operation on a large delete table, a very large binlog write may be generated. Because binlog is written sequentially, before the binlog of a large transaction completes the write disk, some later small write operations such as TDSQL heartbeat writes will also be blocked at the stage of writing binlog, waiting for the completion of large transaction binlog writes. This wait time process can lead to frequent heartbeat write timeouts. This triggers the switching logic, in which case we will observe that there are a large number of completed transactions in the innodb status prepared at the innodb layer, waiting to be written to the binlog, and a large number of heartbeat writes are blocked in the processlist.

At present, TDSQL has been optimized for this situation, such as by default to limit the size of a binlog write to no more than 1.5G to prohibit the generation of large transactions.

The effect of Bian Que's automatic Analysis

Combined with the above analysis process, Bian Que will automatically analyze the reasons for the switch according to the monitoring, DB snapshots before switching and other information, and show the detailed analysis process.

1)。 The following figure shows that Bian Que analyzes that the active / standby switching is caused by the non-survival of DB.

2)。 This example shows that Bian Que automatically combines the full active threads of innodb status before switching and the excessive slow query of processlist to determine that the master / standby switch is triggered by the concurrency of slow query, and Bian Que aggregates the SQL of processlist according to SQL fingerprints to facilitate users to quickly locate which SQL is causing the problem.

3)。 Here, we can see that Bian Que located the active / standby switching caused by large transactions, and found the specific SQL that triggered large transactions.

two。 Performance problem

Let's take a look at the performance of DB and what causes performance problems.

The most intuitive feeling from the user side of the performance problem is that the execution of SQL takes too long. The common causes of this problem are

Network factors, such as delay, packet loss, etc.

SQL itself is slow to execute.

Resource saturation

Lock waiting

We will not delve into the network problems here for the time being, but here we will mainly analyze the latter three situations:

SQL itself is slow to execute.

The slow execution of SQL itself is usually due to the user's failure to establish a suitable index, or due to some reasons in SQL writing, the existing index is not used. Bian Que will automatically analyze this kind of SQL through syntax parsing, table structure accessed by SQL, data distribution and other information, and generate appropriate index optimization suggestions to feedback to users.

Resource saturation

For slow queries caused by resource saturation, such as the soaring resources such as the current CPU/IO, Bian Que's session analysis function automatically aggregates the current session according to SQL fingerprints, so as to quickly find the TOP SQL causing resource consumption and then automatically associate the SQL optimization module to get optimization suggestions, so that both ordinary users and DBA can quickly locate the culprit of resource consumption and know the optimization scheme at a glance.

Lock waiting

Another common factor causing the high time consumption of SQL requests is the lock waiting problem. For example, in transaction 1, a session updates a row but the transaction has not been committed. When a SQL in transaction 2 updates the same row, it needs to wait for transaction 1 to commit before it can be executed. The waiting time will also increase the time consumption of the entire request. In this case, users may find that some simple operations, such as primary key updates, are normally 0ms, and occasionally suddenly become tens of seconds. When the customer gives us feedback, we find that the time consumption of SQL execution may be normal again. Let's take a look at how Bian Que assists customers / DBA to analyze such problems.

In the example in the following figure, we can see that a line of session1 update T1 has not been committed since, and the lock of that line is never released, resulting in a lock timeout for operations on the same line of session2 update.

In this case, as long as the customer's session1 does not commit the transaction and does not disconnect from the DB, the lock held by the session will remain. There are three tables under MySQLinformation_schema that record the lock wait dependencies between transactions. For example, in the figure below, session4 is not blocked by other sessions, but the lock held by session4 blocks the session1,2,3. Here we call session4 the lead session holding the lock. In this case, because the lock waiting scene environment is still there, Bian Que can find the lead session holding the lock by analyzing the information of these three tables and suggest that the user kill session4 to release the lock waiting.

The following picture is the effect picture of Bian Que diagnosing this kind of lock waiting.

In addition to the uncommitted transaction, it is also possible that the user's business logic does not commit the transaction immediately after executing all the SQL in the transaction, resulting in a long lock holding time for the transaction. In the following figure, you can see that session1 did not commit the transaction until 50 seconds after the execution of update T1, resulting in more than 50 seconds for the same line of update operations in session2 and even a lock timeout error. If the user reports that there is such a problem at 12:00 at 15:00, we will find that there is no such lock waiting relationship when we check the lock information under information_schema. In this case, we can only use the SQL log executed by the user. To find out the historical conversation information of session1, then the problem we face is

Where do I extract SQL logs?

How to efficiently find out the historical session information of session1 through the slow query provided by the user?

Where do I extract SQL logs?

TDSQL has a proxy layer between the user and the DB connection, all user SQL execution will first go through proxy, and an efficient log module is implemented in proxy, which can save all the user's executed SQL, execution time consumption, client address and other information after desensitization, and has no impact on performance.

How to efficiently find out the historical session information of session1 through the slow query provided by the user?

Although we have all the historical SQL information of the user, it is still difficult to find directly from the log the session information in which session1 blocks session2 at a certain time, or the transaction information at a certain point in time is covered by the start and end time of the session1 transaction.

Here Bian Que implements a transaction simulator, which can extract all transaction information, such as the start and end time of the transaction, the tables accessed in the transaction, the number of rows affected by the transaction, the total time consumption of the transaction, etc., by grouping the recorded IP:PORT by the client and parsing the SQL executed by the user combined with syntax. In this way, we can find out the specific execution information of a transaction by setting filter conditions on a transaction basis.

Bian Que's diagnostic case

Next, let's take a look at a case where the user reported that there was a lock timeout on the update T01_NOR_CUST_INFO table at 22:00:37.

Bian Que can automatically find out the transaction execution time including 22:00:37 and have a update/delete/insert/selectfor update for T01_NOR_CUST_INFO and other transactions that may have row locks by setting the two conditions of T01 transactions no longer custom custom info, and automatically prompt the user that this transaction takes too long and that holding locks for too long may affect other conversations. With this feature, we can quickly find out the specific information of the affected session according to the time of the slow query provided by the user and SQL, and the user can troubleshoot the business logic repair problem according to the transaction information and time provided by Bian Que.

3. Reliability problem

The reliability problem of DB is that the business may not feel abnormal in database access at present, but want to do a physical examination for DB to determine whether there are potential risks or hidden dangers in DB that will lead to abnormal DB at some point in the future.

For the troubleshooting of potential risks of DB, we evaluate the health status of DB based on performance monitoring, table structure, historical session, slow query and other information combined with Tencent Cloud's massive data + machine learning capability system, detect possible anomalies and inform customers, as far as possible, to warn most of the anomalies before they occur to minimize the risk.

IV. Summary

Above, we introduce the demand background of Bian Que spawned by the pain points of TDSQL operation, as well as the hierarchical structure and constituent elements of Bian Que, as well as the diagnostic principle and practical experience of key technologies such as active / standby switching, lock waiting analysis and so on. After having Bian Que, the work order for consulting on public cloud performance has been basically reduced to 0. We can see that Bian Que's current function can already serve our customers well and improve the quality of life of our DBA classmates. In the future, we will continue to improve Bian Que's diagnosis and prediction ability, integrate our many years of DBA operation experience and AI, machine learning and other technologies, to cover more abnormal scenarios. Strive to predict all anomalies before they occur, so as to provide customers with a more secure operating environment.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.