How to troubleshoot the probabilistic failure of invoking public network services in java 07/09 Update SLTechnology News&Howtos

How to troubleshoot the probabilistic failure of invoking public network services in java

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to troubleshoot the probabilistic failure of calling public network services in java. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

cause

The new system is online and requires PE to perform the operation. But PE, who is in charge of the operation, is really entangled with another developer, which keeps the author waiting for half an hour. In line with the idea of accelerating the launch of the system, I wondered if I could help them deal with the problem quickly, so that the author could send it back to coding as soon as possible. Upon inquiry, this question has been going on for three months, and the phenomenon of the problem is as follows:

Each client will fail with a probability of nearly 1 / 2, and the error will be:

Begin to investigate

Communication with appserver developers and the corresponding PE found that there is a short connection between appserver and nginx, and because it is socketTimeOutException, the problem of establishing a connection between appserver and nginx can be eliminated. Check the log on nginx and find a strange phenomenon, as shown in the following figure:

All appserver calls one nginx all the time, while the other nginx fails. The configurations of the two nginx machines are exactly the same. Another strange thing is that they will fail only when the peer server with the problem is called. Other businesses will not be affected, as shown in the following figure:

As these two strange phenomena lead to disputes between developers and PE, according to the first phenomenon, a nginx is good and a nginx reports an error, then it is reasonable to infer that there is a problem with the second nginx, so the development requires a change of nginx. According to the second phenomenon, only calling this business will make an error, and there is no problem with other businesses, so it must be the problem of the peer business server. PE thinks it should not be the pot of nginx. After arguing for a long time, the preliminary plan is to expand the capacity of nginx to see the effect-_! The author thinks that this plan is unreliable, and blind expansion may lead to counter-effect. Let's grab the bag and see what happens.

Grab the bag

In fact, the author thinks that nginx, as such a general-purpose component, should not have a problem, but should occur on the peer server. According to the peer development response, there was no problem with his own curl, and there was no problem with doing curl on his own server for N times (because of the stalemate, he was sent to our company to assist in troubleshooting). So the net worker grabs the packet outside the firewall, and the result is as follows:

Point-in-time source ip destination ip protocol info

2019-07-25 16:45:41 20.1.1.1 30.1.1.1 tcp 58850-> 443 [SYN]

2019-07-25 16:45:42 20.1.1.1 30.1.1.1 tcp [TCP Retransmission] 58850-> 443 [SYN]

2019-07-25 16:45:44 20.1.1.1 30.1.1.1 tcp [TCP Retransmission] 58850-> 443 [SYN]

Since the ReadTimeOut timeout set by appserver is 3s, the peer has already reported an error after two syn retransmissions. As shown in the following figure:

(note: the tcp_syn_retries set by the linux server where nginx resides is 2)

Analysis of the results of bag capture

Judging from the data obtained from the packet capture, the second nginx sends syn packets to the peer service, but there is no response to the peer service, resulting in the nginx2 creation connection timeout, which in turn causes the ReadTimeOut timeout on the appserver side (appserver is a short connection to nginx).

According to the normal corollary, the SYN from outside the firewall to the peer service is missing. And Aliyun as a very stable service provider, it should be impossible to have such a high probability of loss. From the point of view that the peer server uses a very mature SpringBoot, this kind of bug should not occur. Then it is most likely that there is a problem with the setting of the peer server itself.

As the other side's development came to the scene, so the author directly used his computer to log in to the Ali cloud server where the service is located. First, take a look at dmesg. As shown in the following figure, there are a bunch of errors:

It feels a little relevant, but this information alone can't locate the problem. Then, the author runs netstat-s:

This command gives a very critical message, which translates to 16990 passive connections rejected due to a time stamp! Checked the data and found that this is due to the setting of

In the case of NAT, it will lead to the problem of passively rejecting the connection. In order to solve the above dmesg log, the solution given on the Internet is to set tcp_tw_recycle= 1 and TCP _ timestamps is 1 by default, and our client calls are also from NAT, which accords with all the characteristics of this problem. So the author tried to set their tcp_timestamps to 0.

Dozens of calls have been made, and no more errors have been reported!

Linux source code analysis

Although the problem is solved, the author wants to see what this problem is all about from the source code level, so I begin to study the corresponding source code (based on the linux-2.6.32 source code). Since the problem occurs when nginx first shakes hands with the peer server (that is, sends the first syn), we mainly track the relevant source code in this place:

The code for tcp_timestamps is in tcp_v4_conn_request, and we continue to trace it (the following code ignores other unnecessary logic):

Int tcp_v4_conn_request (struct sock * sk, struct sk_buff * skb) {. / * VJ's idea. We save last timestamp seen * from the destination in peer table, when entering * state TIME-WAIT, and check against it before * accepting new connection request. * the main idea of the note is: * We record the last timestamp in the peer tables when we enter the TIME_WAIT state * then check this timestamp when a new connection request comes in * / / when tcp_timestamps and tcp_tw_recycle are enabled, if (tmp_opt.saw_tstamp & & tcp_death_row.sysctl_tw_recycle & & (dst = inet_csk_route_req (sk) Req)! = NULL & & (peer = rt_get_peer ((struct rtable *) dst))! = NULL & & peer- > v4daddr = = saddr) {/ * TCP_PAWS_MSL== 60 * / / * TCP_PAWS_WINDOW = = 1 * / / the following are all time_wait connections for the same peer ip / / tcp_ts_stamp peer ip The local timestamp / / current time recorded after the state is if (get_seconds ()) within one minute after the last entry into the actual stamp of the time_wait record

< peer->

The timestamp of the packet recently received by tcp_ts_stamp + TCP_PAWS_MSL & & / / tcp_ts (brought by the peer) / / the timestamp brought by the current request of the peer is less than the last recorded peer timestamp (S32) (peer- > tcp_ts-req- > ts_recent) recorded after entering the time_wait state. TCP_PAWS_WINDOW) {/ / add passive connection rejection statistics NET_INC_STATS_BH (sock_net (sk)) LINUX_MIB_PAWSPASSIVEREJECTED) / / enter the discarding and release phase goto drop_and_release;}}.}

The core meaning of the above code is that when tcp_timestamps and tcp_tw_recycle are open, if a new connection comes in within one minute of the last connection entering the time_wait state, and the timestamp of the new connection is less than the timestamp of the last packet that entered the time_wait state, the syn will be discarded and entered into drop_and_release. Let's continue to track drop_and_release:

Let's continue to look at how the system behaves if tcp_v4_conn_request returns 0:

From the tracking of the source code, we can see that when this happens, the corresponding syn packet is discarded directly, and the peer cannot get any response, which is consistent with the result of syn retransmission.

Verify one by one with the appearance of the problem

Why does a nginx always okay and a nginx fail?

Because the timestamp of tcp does not refer to the timestamp currently given by the native with the date command. The rules for calculating this timestamp are not expanded here, just know that the timestamp of each machine is different (and may vary greatly). Because we call the peer using NAT, so the two nginx looks like the same ip to the peer server, so the timestamps of the two will be confused when they are sent to the peer server. The timestamp of nginx1 is larger than that of nginx2, so as long as nginx1 connection requests (short connections) occur within a minute, subsequent nginx2 connection requests will always be discarded. As shown in the following figure:

Why is the peer self-test always normal?

Because the timestamp of the local call is on the same machine (native), there is no confusion.

Why is it normal for nginx2 to call other services

Because tcp_tw_recycle is not enabled on the server where other external services are located. This problem can actually be solved by setting tcp_tw_recycle to 0. In addition, the tcp_tw_recycle parameter has been removed in higher versions of the linux kernel.

Summary

Due to the current shortage of ip addresses and the limitation of DNS packet size (512 bytes), most network architectures use NAT to interact with the outside world, so setting tcp_tw_recycle to 1 will basically cause problems. Generally speaking, this kind of problem needs to have some understanding of tcp protocol in order to find the ultimate root.

This is the end of the article on "how to troubleshoot the probabilistic failure of invoking public network services in java". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.