How to solve the problem of non-smoothing when dubbo traffic is online 07/04 Update SLTechnology News&Howtos

How to solve the problem of non-smoothing when dubbo traffic is online

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

How to solve the non-smooth problem of dubbo traffic online, many novices are not very clear about this. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

Preface

The author recently solved a problem that has plagued the business system for a long time. This problem occurs only at release time, affecting only one or two calls at a time, and is a bit undervalued compared to other issues. Due to various reasons, this problem has to be solved to the extent that the business must be solved, so it is in the hands of the author.

Problem scene

We use dubbo services, which is a stable and mature RPC framework. However, in some applications, as soon as the application is released (or restarted), the request timeout will occur, as shown in the following figure:

And it is the first request that will report an error, and then there will be no problem.

Troubleshoot log

Well, we knew the phenomenon, so we began to check the log at that point in time. There is no log on the Server side, while the Client (App1) side reported an error timeout. The error is as follows:

2019-08-22 20:33:50.798com.alibaba.dubbo.rpc.RpcException: Failed to invoke the method set in the servce XXXFacade, tries 1 times. Start time: 2019-08-22 20 timeout=60000,channel:/1.1.1.1:44502- 32 timeout=60000,channel:/1.1.1.1:44502- 50.474 end time: 2019-08-22 30 timeout=60000,channel:/1.1.1.1:44502- 33 timeout=60000,channel:/1.1.1.1:44502- > 2.2.2.2

Reading the log and reporting an error is the problem with this TCP quintuple (1.1.1.1virtual 44502-> 2.2.2.2virtual 20880). So the author netstat to see the current status of this connection:

Netstat-anp | grep 445021.1.1.1 ESTABLISHED 44502 2.2.2.2 ESTABLISHED

This connection is in a normal ESTABLISHED state, and this is the only connection that calls 2.2.2.2 this server, so the subsequent machine calls 2.2.2.2 this server must only use this connection. Check the log and find that all the other calls are normal.

Train of thought 1:Server processing timeout

According to the above error, it is certain that there is something wrong with the connection. According to the normal way of thinking, is the initialization process (and jit) during the first call causing the server to process the request too slowly? As shown in the following figure:

But the question was quickly rejected by the second request (only one second between the first request). That is, the first request took 60s and timed out, while the second request was successful (and millisecond) one second after the first request was made. If you follow the above ideas, the second request should also be processed after various initializations have been completed, and the card should take quite a long time. In addition, there is no log on the server side, and it feels as if the request has not been received.

Train of thought 2:Client side did not send successfully

So we started train of thought 2, the client side did not send successfully. Because Dubbo is Netty-based, the processing of its request transmission is an asynchronization of NIO (only synchronized to the business by using the future mechanism). First, let's take a look at the timeout detection mechanism of dubbo, as shown in the following figure:

Of course, there is also the timeout mechanism of future, which I will not list here. It can be seen from this mechanism that even if the client sends it (because the write is written to the writebuffer of socket, it will not be full, so it will not report an error), it will not report an error at the moment it is sent, but will have to wait for regular scanning. In order to verify this conjecture, the author began to troubleshoot the log of the connection.

Connection log

Because the rule is the first failure, then the author begins to suspect that there is a problem with the connection creation, but the success of the later request indicates that there is no problem with the connection creation. So, is the first request sent before the connection is created? With such doubts, the author found the following log:

2019-08-22 20 32 51.876 (DubboClientReconnectTimer-thread-1) Successed connect to server / 1.1.1.1 purl 20880. Channel is NettyChannel / 1.1.1.1VR 44502 = > / 2.2.2.2Rd 20880

As shown by the log, the connection string 1.1.1.1virtual 44502 was successfully connected at 2019-08-22 20 / 20 / 32 / 32 / 51.876, and the log start time of the above error is

Request sent time: 2019-08-22 20 32 50.474 connection start time: 2019-08-22 20 20 22 32 JV 51.876 connection successfully set up time: 2019-08-22 20 20 32 Flux 51.876

Request before the successful establishment of the connection, feel very consistent with the author's above conjecture. However, it is wrong to think about it. The above log indicates that the connection was established and successfully established in 1ms at 2019-08-22 20. 32. The request is sent at 50s, and the dubbo code cannot start sending the request before the connection makes the creation action (because 44502 is assigned by kernel, it is impossible to know this port number until the connection action is created, but it is printed in the log), as shown in the following figure:

After thinking about it for a while, the author thinks that it is very likely that the log is inaccurate, it only prints the information in the current connection wrapper class, and the previous faulty connection has been masked by the new connection (in the log). Let's take a look at the corresponding print log code in dubbo:

Private String getTimeoutMessage (boolean scan) {long nowTimestamp = System.currentTimeMillis (); return (sent > 0? "Waiting server-side response timeout": "Sending request timeout in client-side") + (scan? " By scan timer ":") + ". Start time: "+ (new SimpleDateFormat (" yyyy-MM-dd HH:mm:ss.SSS ") .format (new Date (start)) +", end time: "+ (new SimpleDateFormat (" yyyy-MM-dd HH:mm:ss.SSS ") .format (new Date () +", "+ (sent > 0?" Client elapsed: "+ (sent-start) +" ms, server elapsed: "(nowTimestamp-sent):" elapsed: "+ (nowTimestamp-start)) +" ms, timeout: "+ timeout +" ms, request: "+ request +", channel: "+ channel.getLocalAddress () +"-> "+ channel.getRemoteAddress ();}

This 1.1.1.1 44502 is derived from channel.getLocalAddress (). Let's look at the call chain:

Channel.getLocalAddress () |-> nettyClient.channel.getLocalAddress ()

The author looked at the code, dubbo will replace the nettyClient.channel when reconnect (when reconnecting), so it is possible to print out different client port information than when sending the request. Add in the thread number DubboClientReconnectTimer-thread-1 printed in the log and you can almost be sure that there was a connection before, but this connection failed, and the new connection is 1.1.1.1f: 44502. After looking for the log carefully, I found the following log:

-Log Segmentation Line-2019-08-22 20 2.2.2.2:20880on create new netty channel 32 51876 (DubboClientReconnectTimer-thread-1) 51876 (Log Partition Line) / 1.1.1.1 44471: > Log Division Line / 1.1.1.1 44502 = > / 2.2.2.2 Virtue 20880- -Log Segmentation Line-2019-08-22 20 32 51.876 (DubboClientReconnectTimer-thread-1) Log Separator / 1.1.1.1 20880. Channel is NettyChannel / 1.1.1.1VR 44502 = > / 2.2.2.2Rd 20880

That is to say, dubbo immediately created 2.2.2.2 dubbo after shutting down 1.1.1.1VO44471, then the whole process is as follows:

Let's take a look at when 1.1.1.1virtual 44471 was created, and continue to search the log:

Zgrep '1.1.1.1 ZkClient-EventThread-23 44471' dubbo.log.1.zip2019-08-22 20 20 Successed connect to server 31 Successed connect to server 59.871 (ZkClient-EventThread-23) Channel is NettyChannel / 1.1.1.1 Close old Netty channel 44471 = > / 2.2.2.2 Close old Netty channel 20880-Log Segmentation Line-2019-08-22 20 32 Close old Netty channel (Log split Line) / 1.1.1.1 Close old Netty channel: > 2.2.2.2

Found that it was in

Connection 1 creation time: 2019-08-22 20 purl 31purl 59.871 request sending time: 2019-08-22 20 vizz 32 purl 50.474 connection 1 close time: 2019-08-22 20 purl 32 purl 51876

Moreover, the author looked through the business log and found that connection 1 had only one request in its life cycle, which was an error request, which showed that connection 1 could not be used since it was successfully established.

Why is the connection established successfully and cannot be used

First of all, because dubbo is based on netty, this mature and widely used framework should not have a problem creating connections. And the author looks through netty carefully to look at the netty source code and does follow the writing routine of NIO. Then the problem may be the environment, since the client/server itself log can not find more useful information, then see if the operating system can trace any clues. So the author first used dmesg to check the log printed by kernel. The client side did not find any useful information, but the dmesg on the server side attracted the author's attention.

Possible SYN flooding on port 20880. Sending cookies.possible SYN flooding on port 20880. Sending cookies.possible SYN flooding on port 20880. Sending cookies.possible SYN flooding on port 20880. Sending cookies.possible SYN flooding on port 20880. Sending cookies.

It literally means that kenerl itself may have been flooded by SYN on port 20880, sending cookies. This log will be printed by kernel when there are too many concurrent connections. The author netstat under the frequent problems of the machine, found that its characteristics are particularly many connections (up to thousands or even tens of thousands). The online mechanism of dubbo is as long as zookeeper pushes to all machines after registering to zookeeper, and all machines connect immediately, which will result in high concurrent connections, as shown in the following figure:

Since the high concurrency connect is suspected, the author immediately uses netstat to take a look at the statistics of kernel for tcp:

Netstat-s...TcpExt:... 1607 times the listen queue of a socket overflowed.

Sure enough, there are statistics caused by this concurrent connection, and socket overflowed indicates that the tcp_backlog queue on the server side has overflowed. At this point, the author is 80% sure that the tcp_backlog setting is too small. The following is the analysis of why the tcp_backlog overflow will lead to the above connection although successful but the first error phenomenon, so as to confirm with the log, to prove the author's guess.

Tcp_backlog overflow Analysis of tcp's three-way handshake

Start with the analysis of the three-way handshake of tcp, and start with the tcp_v4_do_rcv function, because this function is where the tcp protocol obtains the packet. When the SYN of the three-way handshake comes in for the first time, it should use this function:

/ / first examine the first SYNint tcp_v4_do_rcv of the three-way handshake (struct sock * sk, struct sk_buff * skb) {if (sk- > sk_state = = TCP_LISTEN) {. / / since the connection is now in the first SYN, the socket on the server side is in the LISTEN state struct sock * nsk = tcp_v4_hnd_req (sk, skb);. }} / / Let's go on to look at the tcp_v4_hnd_req code, the branch of the first SYN kernel in the three-way handshake tcp_v4_do_rcv / * because tcp_v4_hnd_req did not find the connection corresponding to this quintuple So do nothing * / |-> tcp_v4_hnd_req |-> tcp_rcv_state_process / * * case TCP_LISTEN & & th- > syn * / |-> conn_request (tcp_v4_conn_request)

We went on to analyze the conn_request function, and because there was a lot of logic in it, we filtered out a lot of details:

Int tcp_v4_conn_request (struct sock * sk, struct sk_buff * skb) {/ / if the reqsk queue of csk is full, set the want_cookie identity if (inet_csk_reqsk_queue_is_full (sk) & &! isn) {if (sysctl_tcp_syncookies) {want_cookie = 1 }. If (want_cookie) {/ / the following line is where the possible SYN flooding is printed, syn_flood_warning (skb);. } / / send synack, that is, the second handshake of the three-way handshake / / Note that when want_cookie, the corresponding req is not added to the hash list of reqsk if (_ _ tcp_v4_send_synack (sk, req, dst) | | want_cookie) goto drop_and_free / / if syack is sent successfully, add inet_csk_reqsk_queue_hash_add (sk, req, TCP_TIMEOUT_INIT) to the hash list;. Drop_and_free; reqsk_free (req); return 0;}}

Above are the first two handshakes of the three-way handshake, as shown in the following figure:

Now let's take a look at the most critical third handshake (the process that the client sends to server ack), which still starts with tcp_v4_do_rcv.

Tcp_v4_hnd_req |-> tcp_v4_hnd_req// there are two cases: static struct sock * tcp_v4_hnd_req (struct sock * sk, struct sk_buff * skb) {/ / first check whether there is a corresponding quintuple struct request_sock * req = inet_csk_search_req (sk, & prev, th- > source) from inet_csk_search_req. Iph- > saddr, iph- > daddr) / / tcp_check_req under the req branch here is the key point. Let's analyze if (req) return tcp_check_req (sk, skb, req, prev). / / handle the established case / / if the inet_csk is not found, you may need syn cookie. The following statement should be syn sk = cookie_v4_check (sk, skb, & (IPCB (skb)-> opt)); / / if the processing is successful, add it to the inet_csk_reqsk_queue_add queue. }

The above code snippet mainly shows that if the three-way handshake is successful, the newly created server sock will be put into the inet_csk_reqsk on the server side, and then processed by the later tcp_child_process.

Tcp_check_req

After analyzing the general tcp three-way handshake code logic above, let's take a look at the core branch tcp_check_req that causes the problem in this article. This sentence can only be executed after the corresponding req is placed in inet_csk_reqsk (that is, no cookie or cookie verification is required). The source code is as follows:

/ / the following is the process after client ack returns / / for the sake of simplicity, we omit the logic of cookie check struct sock * tcp_check_req (.) {/ * various processes, various checks * / child = inet_csk (sk)-> icsk_af_ops- > syn_recv_sock (sk, skb, req, NULL) / / if syn_recv_sock returns empty, skip to listen_overflow if (child = = NULL) goto listen_overflow;. Setting _ overflow: / / if tcp_abort_on_overflow is not set, set acked=1 if (! sysctl_tcp_abort_on_overflow) {inet_rsk (req)-> acked=1; return NULL }} / / add kernel statistics parameter NET_INC_STATS_BH (sock_net (sk), LINUX_MIB_EMBRYONICRSTS); / / send reset message if (! (flg & TCP_FLAG_RST)) req- > rsk_ops- > send_reset (sk, skb) if tcp_abort_on_overflow is set / / delete the current sock from the csk_reqsk at the same time, that is, when the next message on the client comes, the corresponding csk_reqsk inet_csk_reqsk_queue_drop (sk, req, prev); return NULL cannot be found.

If you enter the listen_overflow branch, there will be two phenomena on the server side, one is to directly set acked to 1 (only set the identity not to send ack), and the other is to send ack. Both branches remove the corresponding req from the reqsk_queue. The following picture shows the situation of three-way handshake after listen_overflow:

Do not set tcp_abort_on_overflow

Set up tcp_abort_on_overflow

It can be seen from the above that tcp_abort_on_overflow will be set to closed immediately after the three-way handshake. Second, if you do not set tcp_abort_on_overflow, you will not know that the connection has actually been discarded by server until the first request is sent.

When is the listen_overflow?

Take a look at the above code, that is, when syn_recv_sock returns null, it will go to listen_overflow, and the corresponding tcp protocol function is:

Struct sock * tcp_v4_syn_recv_sock (struct sock * sk, struct sk_buff * skb, struct request_sock * req, struct dst_entry * dst) {.... If (sk_acceptq_is_full (sk)) goto exit_overflow;.}

In sk_acceptq_is_full, it is determined that the queue length of full is the lowest of the following three parameters:

Take the minimum of the three / the back_log passed in when the proc/sys/inet/ipv4/tcp_max_syn_backlog/proc/sys/net/core/somaxconntcp listen function is called

Limited to space, we will not give the source code to judge whether the queue is full or not.

Fix scheme set kernel parameter

At first, the author only set the following three parameters on the server side (naively thought that dubbo took backlog into account and set a larger value, but in fact it used the default 50 of jdk)

Note that 511 is set according to the recommended value of redis / proc/sys/inet/ipv4/tcp_max_syn_backlog 511/proc/sys/net/core/somaxconn 511/proc/sys/net/ipv4/tcp_abort_on_overflow 1

It is found that there are still error logs like Close old netty client, but the business side does tell me that there is no online traffic loss. The author went up to observe the situation and found that no traffic loss was due to the fact that the reset message was sent back immediately as described in the above figure, which reduced the time window that could make an error to 2s instead of the first error report before. Since there were no requests within 2s, the probability of error was greatly reduced. As shown in the following figure:

The interval between the creation of the connection and the close old netty client log is less than or equal to two seconds, which also proves that the tcp_abort_on_overflow is effective.

Dubbo code adds the setting of backlog

Since dubbo uses the default backlog (50) of jdk, the author adjusts the corresponding part of the dubbo source code, as shown in the following code:

Public class NettyServer extends AbstractServer implements Server {. Protected void doOpen () throws Throwable {. / / you can dynamically adjust this backlog String backlog = System.getProperty (BACK_LOG_KEY) from the jvm parameter. If (StringUtils.isEmpty (backlog)) {/ / defaults to 2048. Turn it up here. Anyway, change the kernel parameter to backlog = DEFAULT_BACK_LOG;} bootstrap.setOption ("backlog", backlog) with immediate effect. }}

After the modified dubbo version is given to the business side (server side), there is no corresponding error log, and the dubbo is launched smoothly.

In fact, it doesn't take long from analyzing the problem to guessing that you need to adjust the backlog. But the author likes to confirm all the details and his own conjecture one by one, through

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.