An unsolved pursuit case triggered by tcp_timestamps 09/24 Update SLTechnology News&Howtos

An unsolved pursuit case triggered by tcp_timestamps

2025-09-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Case description: our cooperative customers (well-known domestic electronic payment enterprises) responded that four machines called our interface service, but it was strange that two of the four machines were connected and two were impassable, and the impassable machines were occasionally blocked. This problem has been bothering them off and on for a long time. At first, we thought that the parameters of their system were not configured correctly, so we didn't pay much attention to them. After all, we still have a lot of customers, but there is no problem. It was not until one day that they could not bear it and could not find out the cause, and asked our technicians to help them investigate on the spot, and began an unsolved case.

After taking over the problem that day, we began to ask each other's network technicians to cooperate with the joint investigation, from normal machines, abnormal machines, export network equipment, and our receiving machines to grab packets for testing at the same time; from the analysis of packet grabbing phenomenon, it is true that there is a retransmission phenomenon, and one of the three machines is occasionally connected, but the other two cannot. On that day, we manually drew the network topology diagram of both sides, from which we learned that they did S-NAT when they went out, and at the beginning of the problem, they tried a solution: add an extranet ip, four machines do not share one extranet ip, but two! So two machines are working. The other two are still abnormal. It's a little weird.

The various suspicions in the middle are not described, until we re-look at the grab bag, and we do find the problem. When the other party sends out the syn package in the three-way handshake (no detailed explanation here), we do not have syn+ack confirmation, the other party has been tcp retransmission, and what we can confirm is that the other party's syn package has indeed arrived at us, but our side did not reply to syn+ack! We began to turn our attention to our network.

Screenshot of the three-way handshake:

Screenshot of the exit of the other party:

Screenshot of our entrance:

From the point of view of source port 26414, it is true that the same connection request has been sent from the exit of the other party, but we have not replied to syn+ack for a long time, so there are constant retransmission requests and connection timeout.

At first, we suspect nginx problems, keepd problems, centos problems, and even underlying virtualization kvm/esxi problems, but the endless test troubleshooting we have done is not described here. However, there has been no breakthrough unilaterally, and there is some hope at this time. It is really normal for us to have an esxi ubuntu machine, and the other three machines can be connected, so we begin to compare the kernel configuration parameters sysctl-a > sysctl.txt of our two machines.

Through the comparison, the following parameters are mainly suspected (the specific role can be found online):

Tcp_sack

Tcp_fack

Tcp_syncookies

Tcp_tw_recycle

Tcp_retries1

Tcp_timestamps

It is found that the parameter tcp_tw_recycle is not the same, so it is changed to the same. The other party tested and returned to normal. Then turn it off, test each other, re-abnormal. This parameter is basically locked.

Tcp_tw_recycle is turned off by default, and there are many servers. In order to improve performance, this option is enabled. On some highly concurrent WebServer, net.ipv4.tcp_tw_recycle is turned on in order to quickly recycle ports. Why I turned on the fast recycling here, the other side's server connection is abnormal, careful comparison of the package found that the problem leaked clues.

Yes, if you look at the picture carefully, as long as it is abnormal and there is no reply to the syn packet, the timestamp is incorrect. By default, it should be incremented before the connection is normal. Combined with the other network is to share an Ip to do S-NAT, the cause of the problem was found.

When our server opens the tcp_tw_reccycle, it will check the timestamp. Unfortunately, the timestamp of the packet sent by the other party is jumping randomly (screenshot time field, accurately speaking, the timestamp sent by the other party is lagging, so we will definitely not reply to such a packet), so the server regards the packet with the "backward" timestamp as "the retransmission data of the tw connection of recycle, not a new request", so it discards the non-return packet. Caused a large number of packet losses.

Question: why is the timestamp sent by the other party jumping at random?

By comparing the other party's configuration sysctl-a, it is found that the peer has enabled tcp_timestamps, and if our tcp_tw_recycle is turned on, we will assume that the peer has enabled tcp_timestamps, and then compare the timestamp. If the timestamp becomes larger, it can be reused. However, if the peer is a NAT network (for example, one company uses only one IP to access the public network), or if the peer IP is reused by another, it is complicated. The SYN that created the link may just be discarded (you may see an error in connection time out) (if you want to see the kernel code of Linux, see the source tcp_timewait_state_process).

See here the problem seems to have been solved, but no, this is not the best solution. Because at present, we have come up with two solutions:

1. If we turn off tcp_tw_recycle, if we do not turn on the recycling function, we will not check the timestamp, and this kind of packet will definitely be responded.

two。 The other party closes the tcp_timestamps so that the message is sent without a timestamp. But these two methods have their own advantages and disadvantages.

The benefit of shutting down tcp_tw_recycle: similar problems can be avoided when there is no guarantee that all customers have turned off tcp_timestamps. Disadvantage: if the load is high, the time_wait status of web_server will soar. The official recommendation is to turn off this option:

two。 The other party shuts down tcp_timestamps. Benefits: the problem is solved: check the rfc document, which is an optional tcp field that can have some impact.

Finally, a message with no timestamp optional field is displayed. The field test passed, and it is also available normally:

Messages with optional fields of timestamp:

At this point, an unsolved pursuit triggered by tcp_timestamps is over. To draw anything from here, we must seek truth from facts, carefully grasp the package analysis, carefully check the documents, and speak with data. Do not guess all day instead of going to the field research, this will only waste more time, of course, the article still has a lot of flaws, after all, there are too many technical points involved, hope to correct.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.