Is the SQL running slowly? Check the middleware 07/03 Update SLTechnology News&Howtos

Is the SQL running slowly? Check the middleware

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Preface

Recently, I found a strange problem online, which has been positioned by the author for a long time, and the process of troubleshooting the problem is still very interesting. It just so happens that the blog has not been updated for a long time, so I wrote this article based on it.

Bug scene

After a year of precipitation, our sub-database and sub-table middleware has reached a relatively stable stage. And through the test of on-line pressure test, a single machine can execute 1.7W sql per second. But there are still some surprises online. One line of business reflects that several sql every day have a timeout of more than ten seconds. Moreover, sql is a primary key update or primary key query. What is even more strange is that the timeout occurs with different sql, which seems to be irregularly found, as shown in the following figure:

It is worth noting that only part of the traffic in this business goes to our middleware, while the other part goes directly to the database, while the timeout sql only occurs when the middleware is connected, as shown below:

Obviously, it is the problem caused by the introduction of middleware.

Check to see if sql is really slow

Because the database middleware only cares about sql and does not record the traceId of the corresponding application, it is difficult to match the corresponding request with the sql. Here, we first roughly count whether the type of sql that times out on the application side will time out. Analysis of the log, found that during that period of time all the sql in the back-end data execution is only 0.5ms, very fast. As shown in the following figure:

It seems that the interaction between the middleware and the database is normal, so continue to search for clues.

Looking for the law of timeout

Because it is difficult to bind the relationship between the corresponding request and the middleware execution sql, the author wants to list all the anomalies and see if their time points are regular, in order to investigate the phenomenon of middleware performance degradation caused by batch processing. Here are some pieces of information given by some timed-out sql business parties:

Application ip that executes sql at service start time (s) 2018-12-24 09:45:24xx.xx.xx.24711.752018-12-24 12:06:10xx.xx.xx.24010.772018-12-24 12:07:19xx.xx.xx.13813.712018-12-24 22:43:07xx.xx.xx.24710.772018-12-24 22:43:04xx.xx.xx.24513.71

There seems to be no rule, slow sql exists on top of different application ip, ruling out the possibility of a problem with an application. Timeouts were found from 9: 00 a.m. to 22:00, excluding the possibility of performance degradation at a certain point.

Notice a slight pattern.

After observing a pile of data for a period of time, the author finally found a small rule, as shown in the following two items:

Service start time execution sql applications ip service execution time (s) 2018-12-24 22:43:07xx.xx.xx.24710.772018-12-24 22:43:04xx.xx.xx.24513.71

The corresponding time points of these two sql timeouts are quite close. One is 22:43:07 and the other is 22:43:04, with a difference of only 3 seconds. Then they are added to the time taken by the subsequent business execution to find that they are closer. Let's reorganize:

Service start time to execute sql applications ip service execution time (s) service completion time (s) 2018-12-24 22:43:07xx.xx.xx.24710.7722:43:17.772018-12-24 22:43:04xx.xx.xx.24513.7122.43:17.71

It may be a coincidence or the result of the emergence of bug that it is found that although the two transactions started at different times, they were completed at the same time. So continue to see if there are these regular slow sql, so that the business provides the latest slow sql, and found that although this phenomenon is rare, it does happen more than once. The author suddenly feels that this is definitely not a coincidence.

Thinking caused by the above-mentioned law

The author thinks that we have many sets of middleware, assuming that the middleware is stuck there, if at that moment, two sql fall to the same one at the same time, the middleware first gets stuck, and then at the moment when the middleware is restored, it will lead to this phenomenon when the middleware is executed at the speed of 0.5ms and then returned. As shown in the following figure:

Of course, there is another possibility, that is, sql is executed at the speed of 0.5ms, and then the middleware gets stuck. The only difference from the above is the location of the middleware card. Another possibility is shown in the following figure:

Does it fall into the same middleware?

There are a total of 4 middleware online, and after a lot of complex online log analysis, it is found that the two sql really fall on the same middleware. In order to ensure that the conjecture is correct, two more sql in accordance with this law are found, which also fall on the same middleware, and the two middleware are not the same, excluding a machine from having a problem. As shown in the following figure:

Compare the business log with the middleware log

On the basis of the above findings, and after a variety of log analysis correspondence, finally found the time-consuming sql log and business log corresponding relationship. And found a key message. When the middleware receives the sql, it prints a log and finds that there is a gap of about 10 seconds before the application sends out the sql and receives the sql before it has time to do the routing logic behind it. Then the sql executes the return very quickly, as shown in the following figure:

Check to see if there are any anomalies in other sql at that time point of the corresponding middleware

The author got the log of the middleware at that point in time and found that except for these two sql, the other sql was very normal, and the overall time-consuming was around 1ms, which made the author fall into thinking.

Then look for information from the log.

After doing all kinds of thinking and analysis of the current middleware logs, we found another strange point. It was found that within 1 second, the number of processing sql of NIO threads dealing with slow sql was much less than that of other NIO threads. Further, it is found that the Nio thread in which the slow sql resides does not print any logs at all before a certain point in time of the 1s, as shown in the following figure:

At the same time, we also found that the two sql fell into the corresponding Reactor-Thread-2 thread, and looking back, we found that the thread within nearly 10 seconds did not print any information, as if nothing had been processed. As shown in the following figure:

It feels like it's getting closer to the truth. It's obvious here that the reactor thread is stuck!

Find out why the reactor thread is stuck

The author continues to follow the trail, comparing the logs finally printed by several stuck reactor threads, and found that the corresponding sql of these logs have been returned quickly, with no exception. Then we compare the first sql printed out after the recovery of several stuck reactor threads. It seems that they all parse very slowly through routes, reaching 1ms (normally 0.01ms). Then we find out their corresponding sql and find that these sql are all about 150K in size. According to normal thinking, the missing 10s should be used to deal with the 150K sql, as shown in the following figure:

Why does it take 10 seconds to process 150K sql to find out if it is a network problem?

First of all, this sql exists before it is connected to the middleware, which takes about 0.5ms. And the middleware sends sql to the database in about the same time. If there is a problem with the network, then this period of time should be longer, this situation will not be considered for the time being.

Troubleshoot if it is the nio network processing code

In view of the fact that it may be the problem that the middleware nio handles the code, the author constructs the same sql to reproduce it offline and finds that there is no pressure to execute it quickly. The author once suspected that it was the problem of the online environment. Traceroute found that the network is basically the same as the offline environment, all APP machines are directly connected to middleware machines, the MTU is 1500, and there is no routing in the middle. The train of thought came to a standstill again.

Dense willow trees and bright flowers

After thinking for a long time to no avail. The author thinks that there is something wrong with the scene constructed, and suddenly finds that prepareStatement is used online, while the author uses statement on the command line. There is a difference between the two. Prepare is in accordance with select? The form with parameters and statement is directly in the form of select 1, 2, and 3.

In our middleware, we also support prepareStatement because the back-end database has a great performance improvement on sql using prepareStatement. And in order to reuse the original sql parsing code, we will restore it without the corresponding sql and parameters after receiving it. The sql calculates the way out to the database node after the original belt? Sql and parameter prepare to the corresponding database, as shown in the following figure:

Reconstruct the prepareStatement scene

After the author reconstructed the prepareStatement scene, I found that it did take 10 seconds under the 150K sql, and I felt that I had finally seen the dawn.

Final cause string concatenation

Because it is offline, after logging in various places, finally found that time-consuming is in this will take? The sql is rendered on top of sql without a question mark. The following is a code illustration:

String sql= "select"; for (int iTun0; I < paramCount;i++) {sql= sql.replaceFirst ("\\?", param [I]);} return sql

This replaceFirst consumes a lot of performance when the string is particularly large and the frequency of characters that need to be replaced is particularly high. It was previously found that there is a regular operation in the replaceFirst operation that causes the special symbol not to render the sql correctly (in addition, the parameter contains? It doesn't render correctly), so it uses split instead? To render sql in the same way. However, this version is not used on the corresponding cluster of this application. It may also be these additional regular operations that cause the replaceFirst performance to be particularly poor in this case.

Corresponding optimization

After changing it to a new version, the new code is as follows:

String splits [] = sql.split ("\\?"); String result= ""; for (int item0

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.