How to solve the problem of TIME_WAIT accumulation caused by Tengine health examination 04/26 Update SLTechnology News&Howtos

How to solve the problem of TIME_WAIT accumulation caused by Tengine health examination

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to solve the problem of TIME_WAIT accumulation caused by Tengine health examination, many novices are not very clear about this. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

1. Problem background

"after the service is on the cloud, our TCP ports are basically in the state of TIME_WAIT" and "this problem has never happened in the offline computer room" is the description of the problem submitted by the customer.

The customer environment is a self-built Tengine as a 7-tier reverse proxy, with about 18000 NGINX at the back end. After Tengine is launched to the cloud, a large number of TCP socket; with TIME_WAIT status are found on the server, which may potentially affect business availability due to the large number of backends. Compared with the previous experience, users are more worried about whether it may be caused by connecting to Aliyun, so they hope we can analyze it in detail.

Note: the problem with TIME_WAIT state monitoring is that the host cannot assign dynamic ports to external connection requests. At this point, net.ipv4.ip_local_port_range can be configured to increase its port selection range (you can consider 5000-65535), but it is still possible that it will be used up within 2 MSL.

2. Cause analysis of TIME_WAIT

First of all, if we review the TCP state machine, we can see that the port in the TIME_WAIT state only appears on the side that actively closes the connection (regardless of whether that party is on the client or server side). When the TCP protocol stack makes a connection close request, only the [active close connection party] will enter the TIME_WAIT state.

And the customer's concern is also here.

On the one hand, health check using HTTP1.0 is a short connection, logically, the back-end NGINX server should actively close the connection, and most TIME_WAIT should appear on the NGINX side.

On the other hand, we also confirm by grabbing packets that the first FIN request that most connections are closed is initiated by the back-end NGINX server. In theory, the socket of the Tengine server should directly enter the CLOSED state without so many TIME_WAIT.

The packet capture is as follows, and we filter it according to the socket port number of TIME_WAIT on Tengine.

Figure 1: an HTTP request interaction process

Although the above package capture results show that the current Tengine behavior does seem strange, in fact, through analysis, such a situation still exists logically. In order to explain this behavior, we should first understand that the network packet caught by tcpdump is the "result" of the packet sent and received on the host. Although the socket side looks like a passive receiver in terms of packet grabbing, the determining factor in the operating system whether the socket is actively shut down or not is how the TCP stack within the operating system handles the socket.

For this packet capture analysis, our conclusion is that there may be a competitive condition (Race Condition). If the operating system shuts down the socket and receives the FIN sent by the other party at the same time, then determining whether the socket enters the TIME_WAIT or the CLOSED state depends on which occurs first (the active shutdown request (the Tengine program calls the close operating system function against the socket) or the passive shutdown request (the tcp_v4_do_rcv handler function called by the operating system kernel thread after receiving the FIN).

In many cases, different environmental factors, such as network delay and CPU processing capacity, may bring different results. For example, due to the low latency in the offline environment, passive shutdown may occur first. Since the service is launched to the cloud, the delay between Tengine and backend Nginx has been lengthened due to distance, so the active shutdown of Tengine occurs earlier, and so on, resulting in inconsistency between cloud and cloud.

However, if the current behavior seems to be in line with the standards of the agreement, then how to solve the problem head-on becomes more difficult. We cannot delay active connection closure requests by reducing the performance of the host where the Tengine is located, nor can we reduce the delay consumption due to physical distance and speed up the collection of FIN requests. In this case, we would recommend adjusting the system configuration to alleviate the problem.

Note: there are many ways to quickly alleviate this problem in current Linux systems, such as:

A) configure tw_reuse when timestamps is enabled.

Net.ipv4.tcp_tw_reuse = 1

Net.ipv4.tcp_timestamps = 1

B) configure max_tw_buckets

Net.ipv4.tcp_max_tw_buckets = 5000

The disadvantage is that it will be written in syslog: time wait bucket table overflow.

Since users use self-built Tengine and are unwilling to perform mandatory cleaning of TIME_WAIT, we consider using Tengine code analysis to see if there is an opportunity to change the Tengine behavior without changing the Tengine source code to prevent socket from being actively shut down by Tengine.

Tengine version: Tengine/2.3.1

NGINX version: nginx/1.16.0

2.1 Tengine code analysis

From the previous package capture, we can see that most of the TIME_WAIT socket is created for the back-end health check, so we mainly focus on the health check behavior of Tengine. The following is an excerpt from ngx_http_upstream_check_module 's open source code about socket cleanup.

Figure cleaning up the socket process after 2:Tengine health check is completed

From this logic, we can see that Tengine closes the connection directly after receiving the packet if any of the following conditions are met.

C-> error! = 0

Cf- > need_keepalive = false

C-> requests > ucscf- > check_keepalive_requ

Figure 3: the function in Tengine that actually completes the socket shutdown

Here, if we make the above conditions not met, then it is possible for the operating system where Tengine is located to handle the passive shutdown request first, perform socket cleaning, and enter the CLOSED state, because according to the HTTP1.0 protocol, the NGINX server must take the initiative to close the connection.

2.2 solution

In general, we don't need to care too much about the connection to TIME_WAIT. Generally, after 2MSL (default is 60s), the system automatically releases it. If you need to reduce, you can consider the long link mode, or adjust the parameters.

In this case, customers are familiar with the protocol, but they are still worried about the forced release of TIME_WAIT. At the same time, because there are 18000 hosts in the backend, the overhead caused by persistent connection mode is even more unbearable.

Therefore, based on the previous code analysis, by combing the logic in the code, we recommend the following health check configuration:

Check interval=5000 rise=2 fall=2 timeout=3000 type=http default_down=false

Check_http_send "HEAD / HTTP/1.0\ r\ n\ r\ n"

Check_keepalive_requests 2

Check_http_expect_alive http_2xx http_3xx

The reason is simple: we need to make the three conditions mentioned above not met. In the code, we do not consider the error situation, and need_keepalive defaults to enable in the code (if not, it can be adjusted through configuration), so we need to make sure that the check_keepalive_requests is greater than 1 to enter the KEEPALIVE logic of Tengine to prevent Tengine from actively closing the connection.

Figure 4:Tengine Health check reference configuration

Because the HEAD method of HTTP1.0 is used, the back-end server will actively close the connection after receiving it, so the socket created by Tengine enters the CLOSED state to avoid entering TIME_WAIT and occupying dynamic port resources.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.