How to troubleshoot goroutine leakage 07/13 Update SLTechnology News&Howtos

How to troubleshoot goroutine leakage

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to troubleshoot goroutine leakage". In daily operation, I believe many people have doubts about how to troubleshoot goroutine leakage. The editor consulted all kinds of data and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubt of "how to troubleshoot goroutine leakage"! Next, please follow the editor to study!

It's easy to create a goroutine in golang, but improper use can cause a large number of goroutine to fail to end and resources to be released, resulting in memory leaks over time. The key to avoid goroutine leakage is to manage the life cycle of goroutine reasonably. The problem of goroutine leakage can be found and solved by deriving runtime index and using pprof.

The author maintains a service that connects to the target machine and executes commands through SSH, which is an internal small service that usually doesn't pay attention to when there is no problem. About 4 months ago, during the last update, a task counter was added and exported to prometheus for monitoring. Recently, it has been found that this counter is increasing steadily.

The first reaction is, good thing! The amount of transfer has increased steadily! But it's not right to think about it. How can there be so many calls for this internal small service? So let's take a look at the monitoring of goroutine (this data is obtained from runtime.NumGoroutine ())

The number of goroutine also increased steadily, the number of requests per unit time increased, the number of goroutine also increased, there is nothing wrong with it. But then think again, internal small service, is not possible.

So take another look at the view of all requests in the mm system:

As you can see, the number of requests is about 2000 every 5 minutes, averaging 400 requests per minute. In the above prometheus monitoring chart, each curve is an instance, and actually four instances are deployed. Therefore, the number of requests for a single instance (curve) divided by 4 should be around 100/min. When the service was just started, the counter was indeed around 100/min, and gradually leaked over time.

Goroutine leakage (Goroutine leak)

Although I think 99% of it is a leak, I still need to look at some detailed information. Net/http/pprof was previously enabled in the service, so directly request the HTTP interface exposed by pprof.

First take a look at the exported goroutine summary:

There are 1000 + goroutine in the same state. Simply wait for the data to be read, and then take a look at the exported goroutine details:

Do not see do not know, a look surprised, there are details of goroutine blocking time of more than 20w minutes (4 months).

There is no doubt that goroutine was leaked. Why did it leak? Only follow the goroutine information exported by pprof to troubleshoot. The call stacks of more than 1000 goroutine with the largest number of IO wait states have been typed out. According to the contents of this call stack, the location of the corresponding code is found, and it is blocked from the beginning of ssh.Dial to a certain place for io.ReadFull.

This service uses the golang.org/x/crypto/ssh package for ssh connections. First take a look at the place where ssh.Dial is called in this service:

ClientConfig: = & ssh.ClientConfig {... Timeout: 3 * time.Second,...} / / connet to ssh client, err: = ssh.Dial ("tcp", fmt.Sprintf ("% Spura% d", s.Host, 36000), clientConfig)

There seems to be no problem, after all, a Timeout parameter is passed in and should not be blocked. Then I looked on and found some problems. Going directly to the blocked place in the call stack (not looking at library and runtime, which are generally fine) is the first step in SSH Handshake, exchanging SSH versions.

/ / Sends and receives a version line. The versionLine string should / / be US ASCII, start with "SSH-2.0-", and should not include a / / newline. ExchangeVersions returns the other side's version line. Func exchangeVersions (rw io.ReadWriter, versionLine [] byte) (them [] byte, err error) {. If _, err = rw.Write (append (versionLine,'\ rpm,'\ n'); err! = nil {return} them, err = readVersion (rw) return them, err} / / maxVersionStringBytes is the maximum number of bytes that we'll / / accept as a version string. RFC 4253 section 4.2 limits this at 255 / / chars const maxVersionStringBytes = 255 / / Read version string as specified by RFC 4253, section 4.2. Func readVersion (r io.Reader) ([] byte, error) {versionString: = make ([] byte, 0,64) var ok bool var buf [1] byte for length: = 0; length

< maxVersionStringBytes; length++ { _, err := io.ReadFull(r, buf[:]) // 阻塞在这里 if err != nil { return nil, err } ... } ... return versionString, nil } 看逻辑是在给对端发送完自己的版本信息后，等待对端回复，但是一直没有收到回复。但是为什么会没回复，为什么没有超时，刚开始看到这里的我是懵逼的。我只能想到既然这些都阻塞在等待对端回复上，那么一定有对应的连接存在，我先看看机器上的连接有什么问题。 TCP 连接的半打开状态 (TCP half-open) 在机器上执行了一下 netstat 命令看了下连接数。 # netstat -anp|grep :36000|awk '{print $6}'|sort|uniq -c 2110 ESTABLISHED 1 LISTEN 41 TIME_WAIT 有大量处于 ESTABLISHED 的进程，数量和 goroutine 数能大致对上。先把注意力放到这些连接上，选其中一两个看看有什么问题吧。接着便发现，有些连接，在本机有 6 个连接：但是，对端一个也没有(图上那一个连接是我登录到目标机器的 ssh 连接)：

Google checked and found that this situation belongs to the semi-open state of TCP, which should be due to the reason that the opposite end hung up after the connection was established or other networks could not be connected, and the connection did not start KeepAlive. As a result, one end could not find this situation and continued to display the ESTABLISHED connection, while the other end did not exist this link after the machine hung up and restarted. Now check to see if it really doesn't work to enable KeepAlive:

# ss-aeon | grep: 36000 | grep-v time | wc-l 2110

None of them are open. The number of connections without KeepAlive here is the same as the number of connections with the status of ESTABLISHED shown above in netstat. In fact, there must be a new request coming in between the execution of the two commands. The two pairs cannot be said to be an exact match, but most of them are not open. As long as you can Get here.

If you look at the logic of ssh.Dial, net.DialTimeout is used to establish the connection, while the leaked version of the current network is compiled with go1.9.2, and the KeepAlive of the net.Conn structure returned by this version of net.DialTimeout is off by default (go1.9.2/net/client.go).

The golang.org/x/crypto/ssh package does not explicitly enable KeepAlive when calling net.DialTimeout, depending entirely on the default behavior of the current version of go. In the latest version of go, starting KeepAlive when establishing a TCP connection has been taken as the default behavior, so here I migrated the code to go1.13.3, recompiled it and sent it to the current network, thinking that the problem would be settled.

SSH handshake blocking (SSH Handshake hang)

Actually, it's not. The version compiled with go1.13.3, after running for a period of time, use pprof to see the goroutine situation, there are still many in the IO wait state, and see that the call stack is still the original taste (exchange version information when SSH handshake). Take another look at the connection on the machine:

# netstat-anp | grep: 36000 | awk'{print $6}'| sort | uniq-c 81 ESTABLISHED 1 LISTEN 1 SYN_SENT 23 TIME_WAIT # ss-aeon | grep: 36000 | grep time | wc-l 110 # ss-aeon | grep: 36000 | grep-v time | wc-l 1 # ss-aeon | grep: 36000 | grep-v time LISTEN 0128 100.107.1x.x6:36000 *: * ino:2508898466 s

The connection without KeepAlive is the sshd that listens to port 36000 locally, and everything else is with you. That's no problem. It shows that these blocking should not be caused by the blocking caused by the semi-opening of the TCP, choose one of the IP to have a look.

You can connect with telnet, but you cannot disconnect. It shows that the TCP connection can be established, but the peer does not respond for some unknown reason. Let's see how long this IP connection has existed.

# netstat-atnp | grep 10.100.7x.x9 tcp 00 100.107.1x.x6:8851 10.100.7x.x9:36000 ESTABLISHED 33027/ssh_tunnel_se # lsof-p 33027 | grep 10.100.7x.x9 ssh_tunne 33027 mqq 16u IPv4 3069826111 0t0 TCP 100107-1071x Murray x6purl 8504-> 10.100.7x.x9:36000 (ESTABLISHED) # ls-l / proc/33027/fd/16 lrwx -- 1 mqq mqq 64 Dec 23 15:44 / proc/33027/fd/16-> socket: [3069826111]

The time to execute this command is 17:25 on the 24th, which has been blocked for more than a day. Then the problem here is that there is no timeout control in the application layer. Looking back at the logic of ssh.Dial, the Timeout parameter was not used as a timeout control parameter in SSH handshake. The IO waiting of net.Conn is implemented with non-blocking epoll_pwait under Linux, and the waiting goroutine is suspended until an event comes in. The timeout is handled by setting timer to wake up goroutine, and the exposed interface is the SetDeadline method of net.Conn, so the logic of ssh.Dial is rewritten to SSH.

Add timeout for handshake phase:

/ / DialTimeout starts a client connection to the given SSH server. Differ from / / ssh.Dial function, this function will be timeout when doing SSH handshake. / total timeout = (1 + timeFactor) * config.Timeout / / refs: https://github.com/cybozu-go/cke/pull/81/files func DialTimeout (network, addr string, config * ssh.ClientConfig) (* ssh.Client, error) {conn, err: = net.DialTimeout (network, addr, config.Timeout) if err! = nil {return nil Err} / / set timeout for connection timeFactor: = time.Duration (3) err = conn.SetDeadline (time.Now (). Add (config.Timeout * timeFactor)) if err! = nil {conn.Close () return nil, err} / / do SSH handshake c, chans, reqs, err: = ssh.NewClientConn (conn, addr, config) if err! = nil {return nil Err} / / cancel connection read/write timeout err = conn.SetDeadline (time.Time {}) if err! = nil {conn.Close () return nil, err} return ssh.NewClient (c, chans, reqs), nil}

After replacing ssh.Dial with this function, compile it online, take a look at the connection, and return to normal. (restore to what a small service should look like)

# netstat-anp | grep: 36000 | awk'{print $6}'| sort | uniq-c 3 ESTABLISHED 1 LISTEN 86 TIME_WAIT

Here you will find that, in fact, the problem solved in this paper is how to close the connection in time if there are all kinds of exceptions in the peer, rather than to solve the exception problem in the peer. After all, the SSH is abnormal, who can go up and check the problem? There are a large number of servers in the current network, and the operation is different, so it is reasonable to have anomalies, and it is not realistic to solve them one by one.

End

When I first found the leak, I took a look at the top on the machine. I was frightened by the 50G VIRT occupation at that time. After consulting the group leader (zorro), in fact, most of the time, I don't need to care about this value, just care about the RES occupation. Because RES is the physical memory actually occupied.

Just looking at the VIRT and RES at this point in time, we can't see how much of it is leaking. It can only be compared with the memory footprint at different points in time, such as the version after solving the problem. After running for three or four days, the VIRT footprint is 3.9g, while RES only takes 16m. In this way, it still frees a lot of memory. Or how much memory is occupied by those goroutine that can be seen leaking.

It's easy to create a goroutine in golang, but improper use can cause a large number of goroutine to fail to end and resources to be released, resulting in memory leaks over time.

The key to avoid goroutine leakage is to manage the life cycle of goroutine reasonably. The problem of goroutine leakage can be found and solved by deriving runtime index through prometheus/client_golang and using net/http/pprof.

At this point, the study on "how to troubleshoot goroutine leakage" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.