Why the master-slave switch is not successful 07/01 Update SLTechnology News&Howtos

Why the master-slave switch is not successful

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "Why the master-slave switch is not successful". In the daily operation, I believe that many people have doubts about why the master-slave switch is not successful. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the question of "why the master-slave switch is not successful"! Next, please follow the editor to study!

Fault scenario

Recently, master-slave switching has been carried out online, and most of the applications have been cut off, but the connections of some applications are still on the old master (new slave).

This puzzled the development of the corresponding application, so I asked the author to find out.

How did you find out?

The application developer received a Cat monitoring alarm and found that the request in the application (A) had been steadily failing on several machines. Think of last night just did the database master-slave switching exercise, so on the machine netstat-anp, found that the machine has been connected to the old slave library!

Netstat-anp | grep 1521tcp 00 1.2.3.4 ESTABLISHED 54100 1.1.1.1 ESTABLISHED

The developer feels that the request must have failed because the master never switched to the past. At first glance, it seems very reasonable.

Initiate an investigation

What's going on? It has been 8 hours since the switch was successful, why is the connection still connected to it? So the author ping the domain name of the corresponding database:

Ping db.prd64byres from db.prd (2.2.2.2): icmp_seq=1 ttl=64 time=0.02ms

It's strange that DNS has switched over. Why is the app still connected to the old library?

The first conjecture, DNS delay

The first thing that comes to mind is that there is a delay in the response from master-slave switching to DNS. For example, after the master-slave switch, the DNS will not take effect until after the 2min, so the new connection to the slave library during this period is still available.

This situation is normal, for which DBA is required to kill all the connections of the old owner. After consulting DBA, they reported that they had killed all the connections. And showed me the statistical connection SQL of the database on the spot, and there is really no connection to the corresponding machine. This is strange, the connection on the application machine is in ESTABLISHED state!

Most of the machines are connected to the old library!

At this time, the developer responded to the author that most of the machines corresponding to this application are even old libraries! If it is the DNS delay, it can not be such a coincidence, more than 40!

And the DNS of these machines all points to the new library.

DB does not have a kill connection?

Is it possible that DBA missed the step of kill connection? But it contradicts the DB statistics shown by him and me. So the author asked DBA to netstat on the machine corresponding to the old library. Found that the connection really exists!

Netstat-anp | grep 1.2.3.4tcp 00 1.1.1.1 ESTABLISHED 1521 1.2.3.4 ESTABLISHED

Is there really something wrong with the statistics?

Get the connection creation time

In order to verify the author's conjecture about DNS latency, we use some techniques to get the creation time of this connection. First, netstat-anp | grep 1.2.3.4 find the connection. Since it obviously belongs to the application java process, find the process pid:8299 directly

Netstat-anp | grep 1521tcp 00 1.2.3.4 ESTABLISHEDnetstat 54100 1.1.1.1 ESTABLISHEDnetstat-anp | grep javaabc 8299 java

Now that we have the process pid, we cat / proc/8299/net/tcp directly, get all its connection information directly, and then in the hexadecimal 05F1 of grep 1521 (there is only one connection of 1521 on the current machine)

. Local_address rem_address inode. Xxx:D345 xxx:05F1 23456789.

Find the inode number for this socket (1.2.3.4 541001.1.1.1v 1521). With this inode number, it's easy. Let's go directly.

Ls-all-h / proc/8299/fd | grep 23456789 (inode number). Jan 29 17:43 222-> socket: [23456789]

From this point of view, the connection was created on January 29th. But the time point of master-slave switching is March 19, and this connection has been built for 2 months! Then it can not be the DNS failure problem mentioned by the author. Because the connection has not been reconnected.

DB has been rebooted, how can the old connection be maintained?

Seeing the creation time of this connection, the author's first reaction was, did DBA confirm to kill the connection? Asked DBA if it might be a statistical problem. After hearing this, DBA told me that they had all restarted the database, so how could there be a connection? Take a look at the creation time of the DB process.

Ps-eo lstart,cmd | grep db process name Mar 19 17:52:32 2021 db process name

Judging from the start-up time of the process, it really started on March 19. And this weird connection does belong to the process that started on March 19. No matter how you look at this, it doesn't make sense.

However, since the linux statistics are here (for the time being to be considered reliable), there must be some other weird logic in it.

The child process inherited the connection of the parent process

After thinking about it for a while, the author found a possibility. The parent process first creates a new connection for processing. When the child process fork is created, the child process inherits the parent process's connection. At this time, the parent process exits, leaving only the child process. There will be a strange phenomenon that the connection already exists before the process starts.

In order to verify this problem, the author wrote a simple C program and executed it. The code example is:

Main.c.int main (int argc,char* argv []) {. If ((client_fd = accept (sockfd, (struct sockaddr*) & remote_addr,&sin_size)) =-1) {printf ("accept error!\ n");} printf ("Received a connection\ n"); / / create a two-minute delay to cause the above phenomenon sleep (2 * 60) If (! fork ()) {/ / Child process keeps while (1) {sleep (100000);}} else {/ / parent process closes connection close (client_fd);} return 0;}

Asked DBA, they will not kill-9 all processes follow the standard database restart process (kill-9 all processes will close the connections owned by these processes at the same time, but such a violent operation obviously dare not be used on DB).

If the commercial database we use uses the mechanism shown above, it will cause the previous phenomenon. However, since the session maintained by DB itself is gone, the connection must have been gg in the database dimension (this is also the reason why the database cannot be counted). Since it's still there, the connection must never have processed the request again! Or there must have been a mistake.

Business code logic

If the above conclusion is followed, then the request has not been executed, and there will be no misreporting? According to this logic, there will be a new normal connection only if there is an error in the business. The author went to the machine that reported an error, and since it was wrong, it must have executed SQL, and then triggered Druid to discard the connection and create a new connection.

Sure enough, the machines that have been reporting errors are connected to the new library (but the application developer found that other machines are still connected to the old library, so they turned to me for help), and the creation time is March 29, while the connection to the application that does not report an error is hung on the old library. Pick a few to have a look, these connections hanging in the old library are still created on January 29.

But why are you still reporting mistakes?

Now that the connection is normal (to the new library), why are you still reporting errors? Can it be said that there is a problem with the writing of the business code, and if it is reported wrong, it will always be wrong? So the author directly turned up the source code of the application. It uses the connection to this database to sequence the serial number. Then a detailed analysis of the source code found. It was not handled properly after the database reported an error, and took a problematic branch of code, resulting in never getting the sequence from the database again (the business code will not be put up).

Why are there only a few machines reporting errors?

Because this sequence number is used in a wide range of machine memory, SQL will not be executed until it is exhausted. So only some machines that run out of sequences in memory will run to that problematic branch of code.

Why is the heartbeat not detected?

When you come here, you may wonder? No heartbeat test? No, the apps use Druid data sources, and the version of Druid they use doesn't have a regular heartbeat detection.

Is the master-slave switch successful or not?

The master-slave switch is of course successful. This can be judged by the fact that the other applications are running well after being cut through. The loss of database traffic in master-slave switching is a normal phenomenon that we can expect. However, after the database switch, the application can not be restored, so take a closer look at what is wrong with the application code itself.

At this point, the study on "Why the master-slave switch is not successful" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.