What is the impact of TCP Keepalive on system performance? 10/25 Update SLTechnology News&Howtos

What is the impact of TCP Keepalive on system performance?

2025-10-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what impact TCP Keepalive has on system performance". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "What impact TCP Keepalive has on system performance"!

The circumstances surrounding the accident

At about 15:30 on the same day, we received an Nginx alarm message (thanks to the efforts of the operation and maintenance children, so that we can grasp the system operation in real time), indicating that the number of Nginx Connections exceeded the normal setting. As a website with some reputation in the industry, it is common for OSCHINA community website traffic to suddenly surge. Generally, after Nginx Connection exceeds the alarm threshold set by us, it will naturally fall back after a while (some crawlers may suddenly visit, or some well-meaning children may send test requests, etc.). Therefore, at first, I didn't pay special attention to this warning message, just waiting for "a period of time". To ensure that the site's services are working properly, I carefully looked at the applications in the cluster-everything was fine. At this time, I am also prepared. If the traffic continues to rise and the service is affected, several other applications in the cluster will also turn on the distribution mode through upstream to ensure that the entire station service runs normally.

Number of Nginx Connections Soars

Just when I was hesitating whether I needed to open several other backup applications, I suddenly received MySQL Connection alarm messages, which began to realize the seriousness of the problem. In general, the MySQL database links we configure are enough for all applications in the cluster to read and write data normally. However, if the number of MySQL connections soars, some nodes in the cluster may not be able to get database connections, resulting in some user requests blocked. In just a few minutes, our configured database connection pool was full, a large number of user requests began to slow down because they could not get database connections, and even some users began to experience inaccessible, slow opening, etc.

MySQL Connection Number Soars

Analysis of accident process

OSCHINA website uses a lot of caching technology, so the pressure on MySQL database is basically not high, and QPS will not be very high. But at this point, MySQL has more than 3k connections and seems to be unable to stop. And that piqued my curiosity: what exactly is MySQL doing? Why are there so many connections? If a single very complex query statement gets stuck, resulting in a lot of slow queries, other normal requests will gradually fail to get connected, causing the application to lose its response altogether. Immediately ssh to MySQL that machine to check the machine situation and MySQL slow query, but found that many SQL queries that used to take only tens of milliseconds to execute, but now firmly stuck there without any response or queries take a long time. In this way, it is not that our application started some complex queries that led to MySQL query efficiency reduction and thus "stuck", but simply caused by the traffic surge on the front-end Nginx side.

The basic status of each application is determined, and it is found that some applications have not responded in time or have not responded. At the same time, several colleagues have been in the Q group at my feedback community website can not be opened. What needs to be done urgently is: What is the problem causing the traffic surge? So immediately log in to the front-end Nginx machine query access log, found that there are several requests for a large amount of IP (don't ask me how to find, this thing can be done in many ways), so the decision in the front-end Nginx configured the deny parameter to block those few things IP. Silently observing, it didn't take long for the database connection to begin to drop, the application gradually began to recover, and then MySQL alarms returned to normal. Then I observed the status of several applications in the cluster, and to avoid other unexpected situations, restarted and configured several applications to prepare for emergencies. At this time, it has been almost ten minutes since the failure occurred (the alarm message began to be received), our applications have been restored, MySQL Connection is gradually reduced to a reasonable range that we can accept, and the website can be accessed normally. But amazingly, the Nginx Connection alert continues. Could it be that there were other IP's that he hadn't discovered? After recording the machine where the front-end Nginx is located, it seems that the system has obvious stuck conditions. Top took a glance to know that CPU and memory inexplicably soared a lot. I thought that after the traffic returned to normal, those inexplicably opened Nginx connections would be automatically released, but I didn't see any changes.

This makes things clear: a sudden spike in traffic leads to a large increase in the number of Nginx Connections, which also leads to a large increase in the number of MySQL Connections in the application (this process can also be verified by the alarm process I received). After the problem is resolved and traffic returns to normal, MySQL Connections at the application layer are gradually released and restored. However, Nginx connections caused by failed network requests are never released. Nginx Connection alarms never stop.

At this time, although all applications have been restored, the Nginx Connection alarm has not stopped, which is annoying. There was no other way but to ask @atompi for help. After explaining the situation to the Great God, the Great God decisively logged into the front-end machine and checked the Nginx operation. It didn't take long to reply that many underlying TCP connections still couldn't be released normally, resulting in Nginx Connection remaining high. Then we briefly tested this idea and found that the default TCP Keepalive expiration time was as long as two hours (probably due to an oversight in configuring the system parameters at first). Ma modified the relevant configuration, completely stopped and restarted Nginx, everything returned to normal, and the number of Nginx Connections also decreased and returned to normal.

(When you have a problem you cannot solve, ask God for help.) Have you learned?)

Why Keepalive?

Presumably everyone knows HTTP's Request -> Response stateless mode. In the early HTTP 1.0 era, after each request was completed, TCP connections as the transport layer in the OSI seven-layer model needed to be disconnected, and even each request required TCP three handshakes and four waves to be fully processed. Although such a processing method ensures the accuracy and integrity of network transmission, the efficiency is not high. In order to improve efficiency, the Connection header was written into the standard in the later HTTP 1.1 specification and enabled by default. By defining this standard, the underlying TCP connections are constrained not to be released immediately (multiplexed TCP connections), thereby improving network transmission efficiency. Most modern browsers turn on Connection: keep-alive by default to speed up access. (Modern web servers have their own keepalive_timeout or similar configuration, the exact parameters may vary, but they all have roughly the same effect.) For HTTP Keepalive and analysis of web performance, read HTTP Keepalive Connections and Web Performance.

TCP Keepalive

HTTP Keepalive aside, TCP Keepalive is not an agreed upon standard, but it is widely supported. After a connection has been established on the network, should the current connection be maintained if the application layer has not transmitted data for a long time or other unexpected circumstances occur? TCP Keepalive is the basis for detecting whether a TCP connection needs to be maintained or disconnected. TCP Keepalive sends several network requests with no data content after a period of time to determine whether the current connection should be maintained. The/etc/sysctl.conf file on CentOS has several important parameters for TCP Keepalive:

tcp_keepalive_time = 7200 (seconds)tcp_keepalive_intvl = 75 (seconds)tcp_keepalive_probes = 9 (number of probes)

The TCP Keepalive process waits 7200 seconds to send the first test packet to check whether the current connection should be maintained, and then checks again at an interval of 75 seconds, for a total of 9 checks. These numbers are default values, according to their actual situation to make certain adjustments to achieve better network throughput purposes.

At this point, I believe everyone has a deeper understanding of "what impact TCP Keepalive has on system performance". Let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.