Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Tcp socket file handle leak

2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/03 Report--

Tcp socket file handle leak

Today, we found that there is a socket number alarm on a redis machine, which is a very strange phenomenon. Because there are several redis instances deployed on a single redis server, the number of ports open should be limited.

1. The number of tcp connections displayed by netstat is normal.

Netstat-n | awk'/ ^ tcp/ {+ + state [$NF]} END {for (key in state) print key, "\ t", state [key]} '`TIME_WAIT 221ESTABLISHED 103netstat-nat | wc-l368

The number of tcp connections established is not very large.

Ss-s shows a large number of closed connections

Ss-sTotal: 158211 (kernel 158355) TCP: 157740 (estab 158211, closed 157624, orphaned 0, synrecv 0, timewait 173 and 0) Ports 203Transport Total IP IPv6158355-- RAW 2000 UDP 9 63 TCP 11680 36 INET 12586 39 FRAG 2000 closed 157624

And my system monitoring value method is:

Cat / proc/net/sockstat | grep sockets | awk'{print $3} '158391cat / proc/net/sockstatsockets: used 158400TCP: inuse 89 orphan 2 tw 197 alloc 157760 mem 16UDP: inuse 6 mem 0UDPLITE: inuse 0RAW: inuse 0FRAG: inuse 0 memory 0

Many socket are in the alloc state, have been assigned sk_buffer, and are in closed.

Redis's file discriptes is leaked and has not been reclaimed by the kernel.

3. Track down the real culprit

The above information indicates that there is a socket fd leak, then use the lsof command to check the file handle of the system sock.

Lsof | grep sockjava 4684 apps * 280u sock 0sock 6 0t0 675441359 can't identify protocoljava 4684 apps * 281u sock 0Reagle 6 0t0 675441393 can't identify protocoljava 4684 apps * 282u sock 0Lie6 0t0 675441405 can't identify protocoljava 4684 apps * 283u sock 0Lecol 6 0t0 675441523 can 't identify protocoljava 4684 apps * 284u sock 0 can't identify protocol 6 0t0 675441532 can't identify protocoljava 4684 apps * 285u sock 0Regy 6 0t0 675441566 can't identify protocol

You can see that the value of the Name column is "an't identify protocol" and socket cannot find the open file.

This shows that there is a socket fd leak in the java process (pid=4684).

Ps auxww | grep 4684

The discovery is the log collection tool flume on the redis machine.

4. Solution

Today, it is found that there will still be a large number of closed socket phenomena after restarting flume agent.

Strace flume process, found that the flume process has been suspended.

Sudo strace-p 36111

Process 36111 attached-interrupt to quit

Futex (0x2b80e2c2e9d0, FUTEX_WAIT, 36120, NULL

First of all, I suspect that the file handle is not enough, because the information found by google also improves the file fd, which leads to this problem.

On my machine, the maximum number of files allowed to open is 131072, and the number of fd files is still nearly 1A4 unused.

Lsof | wc-l 10201ulimit-a ulimit-n131072

At this point, my colleague reminded me that there were a large number of other machines with the same problem (flume has been online for three months and was normal before).

This is, I remembered that there is also a flume log can be checked. Looking at flume's log indicates that flume cannot find broker 5.

Nani, not a kafka cluster, not just 4 broker (nodes). At this time, I remembered my colleague who came to spark a few days ago to expand the capacity of the kakf cluster.

The new cluster node port 9092 does not have open access to the computer room where the redis is located.

[SinkRunner-PollingRunner-DefaultSinkProcessor] (kafka.utils.Logging$class.warn:89)-Failed to send producer request with correlation id 63687539 to broker 5 with data for partitions [titan,4]

5. Problem recur

In the article lsof: can't identify protocol, this situation is recreated in python code.

:)

Google lookup is a quick way to solve problems. Sometimes, the result of google will affect the direction of troubleshooting.

After I saw the search results of google, my first feeling was that the operating system's max open files parameter was too small. After discovering that's not the reason. My thinking is still focused on whether the kernel parameters are properly configured. It was only when I knew that the flume deployed on other machines was in the same condition that I realized that there was something wrong with flume itself, so I went to the status of the strace flume process and checked the flume log.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report