How to solve the time-wait accident caused by the bottleneck of redis in highly concurrent Services 07/08 Update SLTechnology News&Howtos

How to solve the time-wait accident caused by the bottleneck of redis in highly concurrent Services

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to solve time-wait accidents caused by redis bottlenecks in highly concurrent services". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to solve the time-wait accidents caused by redis bottlenecks in highly concurrent services".

Abstract

New Year's Day during the order business line told the push system can not send and receive messages normally, as a maintainer of the push system, I am outside chic, can not go back the first time, directly let ops help restart the service, everything is all right, restart is really a big killer. Because the push system itself is distributed deployment, messages have to do a variety of reliability strategies, so restart will not lose message events.

After the log analysis, there were a large number of redis errors, and there were 16w errors in ten minutes. The error in the log is connect: cannot assign requested address. The error is not the error returned by the push service and the redis library, but the errno error returned by the system.

This error is caused by the inability to apply for an available address, that is, to apply for an available socket.

In other words, New Year's Day's online number and order volume have indeed increased a lot. The usual push system has a persistent connection client of 35w, but this peak has soared to about 50w. The cluster has a total of 6 nodes, of which 4 nodes are each resistant to 9w + long connections. In addition, the number of messages pushed has doubled.

Analysis.

The following is the statistics of the kibana log, there are nearly 16w redis errors in the error time interval.

The following is the TCP connection status of the node in question. You can see that the established is in 6w, while the time-wait connection is dry to more than 2w.

Why is there so much time-wait? Whoever shuts down actively will have time-wait, but the push system will not take the initiative to close the client except for protocol parsing failure, even if the authentication failure and the weak network client write buffer is full. It is also confirmed through the log that the tw is not generated by the push system itself.

In addition, the linux host should be initialized for kernel tuning when it is delivered by ops. When the tw_reuse parameter is enabled, time-wait can be reused. Didn't you turn on reuse?

According to the kernel parameters of sysctl.conf, sure enough, the tcp_tw_reuse parameter is not enabled, and the address still in time-wait state cannot be reused quickly, so you can only wait for the timeout of time-wait to be turned off. According to the rfc protocol, you can wait about 2 minutes. When tw_reuse is enabled, the address can be reused after 1 second. In addition, the range of ip_local_port_range ports is not large, which shortens the range of available connections.

Sysctl-a | egrep "tw_reuse | timestamp | local_port" net.ipv4.ip_local_port_range = 35768 60999 net.ipv4.tcp_timestamps = 1 net.ipv4.tcp_tw_reuse = 0

Therefore, the connect: cannot assign requested address error is exposed because there is no available address.

Internal problem investigation

Above is the appearance of the problem, to find out why there are so many time-wait? Again, usually which side is active close fd, which end will generate time-wait. Afterwards, I learned through netstat that the time-wait connection basically came from the redis host.

The following is the connection pool configuration in the push code. The free connection pool is only 50, and the maximum number of connections that can be new can be up to 500. This means that when there are a large number of requests, an attempt is made to obtain a connection from the connection pool with a size of 50. If the connection cannot be obtained, a new connection is new. After the connection is used up, the connection pool needs to be returned. If the connection pool is full at this time, the connection will be closed actively by close.

MaxIdle = 50 MaxActive = 500 Wait = false

In addition, a problem was found. There are several places where the processing logic of redis is asynchronous, for example, each time a heartbeat packet is received, it will go a co-program to update the redis, which also aggravates the scramble for connection pooling and changes to synchronous code. In this way, only one redis connection operation is performed at a time in a connection context.

Solution method

Increase the size of the maxIdle connection pool in golang redis client to avoid the embarrassment of creating a new connection and the pool is full and cannot return the connection without free connections. When pool wait is true, it means that if there are no available connections in the free pool, and the number of connections currently established is greater than the maximum number of free connections in MaxActive, it will block and wait for someone else to return the connection. Instead, the "connection pool exhausted" error is returned directly.

MaxIdle = 300MaxActive = 400Wait = qps performance bottleneck for trueredis

The performance of redis has always been praised by everyone. Without using redis 6.0multi io thread, QPS can generally be about 13w. If you use multiple instructions and pipeline, you can dry up to 40w of OPS commands. Of course, qps is still around 12w-13w.

The level of Redis QPS is proportional to the version of redis, cpu hz and cache.

In my experience, in the intranet environment and the connection object has been instantiated, the time required for a single redis instruction request is usually about 0.2ms, and 200us is fast enough, but why are there so many new connections established because there are no idle connections in the redis client connection pool?

Through the grafana monitoring and analysis of the redis cluster, it is found that several nodes QPS has reached the performance bottleneck of Redis single instance, and the QPS has been dry for nearly 15w. No wonder redis requests from the business cannot be processed quickly. This bottleneck will inevitably affect the delay of the request. The delay of the request is high, and the connection pool can not return the connection pool in time, so it causes the problem mentioned at the beginning of the article. In short, the surge in business traffic has caused a series of problems.

If you find a problem, you need to solve the problem. Redis's qps optimization solution has two steps:

Expand redis nodes and migrate slot to share traffic

Try to change the redis request in the program to batch mode.

It is easy to add nodes, and it is also easy to batch. Initially, when we optimized the push system, we changed the redis operation in the same logic to batch mode. But the problem is that a lot of redis operations are in different logical blocks, and you can't synthesize a pipeline.

Then, further optimization is made to merge the redis requests in different logic into one pipeline, which has the advantages of improving the throughput of redis, reducing the overhead of socket system calls and network interrupts, and the disadvantage of increasing logic complexity. Using channal pipes for queuing and notification increases the overhead of runtime scheduling. The pipeline worker trigger condition is that three command or 5ms timeouts are satisfied, and the timer uses a segmented time round.

Compared with before optimization and modification, the cpu overhead is reduced by about 3%, and the average difference of redis qps under pressure test is about 3w, which can be reduced to about 7w at most. Of course, the message delay is several ms higher in probability.

According to the following figure, the caller pushes the redis command and the chan receiving the result to the task queue, and then consumes it by one worker. The worker assembles multiple redis cmd as pipeline, initiates a request to the redis and gets the result back. After disassembling the result set, the caller pushes the result to the corresponding result chan of each command. After the caller pushes the task to the queue, it always listens to the chan of the transmission result.

Thank you for your reading. The above is the content of "how to solve time-wait accidents caused by redis bottlenecks in highly concurrent services". After the study of this article, I believe you have a deeper understanding of how to solve the problem of time-wait accidents caused by redis bottlenecks in highly concurrent services, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.