How to troubleshoot the Kubernetes machine kernel 07/08 Update SLTechnology News&Howtos

How to troubleshoot the Kubernetes machine kernel

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "how to troubleshoot Kubernetes machine kernel problems", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to troubleshoot the Kubernetes machine kernel".

Concrete phenomenon

There is a slow interface problem in an application in the online environment!

Based on this phenomenon, there are countless reasons that can be listed. This blog mainly describes the process of several surveys and finally how to determine the cause, which may not be applicable to other clusters, so it can be regarded as a reference. The investigation process is quite lengthy, it has been too long in the past, and I am unlikely to recall all the details. I hope you will forgive me.

Network topology

When network requests flow into the cluster, for the structure of our cluster:

User request = > Nginx = > Ingress = > uwsgi

Do not ask why there is Ingress and Nginx, this is a historical reason, some work needs to be undertaken by Nginx for the time being.

Initial positioning

When the request slows down, it will be considered immediately whether the program has slowed down, so after finding the problem, first add a simple mini interface to the uwsgi, this interface is to process quickly and return data immediately, and then request the interface regularly. After a few days of operation, confirm that the access to the interface is also slow, troubleshoot the problem in the program, and be ready to find the cause in the link.

Repositioning-simple full-link data statistics

Since our Nginx has two layers, we need to identify each of them to see which layer is slow. The number of requests is relatively large, if you look at each request, it is not efficient, and it is possible to cover up the real reason, so this process uses a statistical way. Statistics are made by looking at the logs of the two layers of Nginx respectively. Since we have connected logs on ELK, the script for filtering data in ELK is as follows:

{"bool": {"must": [{"match_all": {}}, {"match_phrase": {"app_name": {"query": "xxxx"} {"match_phrase": {"path": {"query": "/ app/v1/user/ping"}}, {"range": {"request_time": {"gte": 1, "lt": 10} {"range": {"@ timestamp": {"gt": "2020-11-09 00:00:00", "lte": "2020-11-12 00:00:00", "format": "yyyy-MM-dd HH:mm:ss" "time_zone": "+ 08:00"}]}} data processing scheme

Nignx log and Ingress log can be obtained according to trace_id, which can be obtained through API of ELK.

# this data structure is used to record statistical results. # [[0,0.1], 3] indicates that there are three records falling in the range of 0 to 0.1 # because the comparison and interval of decimals are troublesome, integers are used. Where 035 is actually the interval of 0,3.5s # ingress_cal_map = [# [[0,0.1,0.1,0], # [[0.1,0.2,0.2,0], # [[0.3,0.4], 0], # [0.4,0.5], 0], # [0.5,1], 0] #] ingress_cal_map = [] for x in range (0,35,1): ingress_cal_map.append ([[x, (xylene 1)], 0]) nginx_cal_map = copy.deepcopy (ingress_cal_map) nginx_ingress_gap = copy.deepcopy (ingress_cal_map) ingress_upstream_gap = copy.deepcopy (ingress_cal_map) def trace_statisics (): trace_ids = [] # the trace_id here has been looked up in advance Trace_id with open (trace_id_file) as f: data = f.readlines () for d in data: trace_ids.append (d.strip ()) cnt = 0 for trace_id in trace_ids: try: access_data for those requests with long response time Ingress_data = get_igor_trace (trace_id) except TypeError as e: # continue to try try: access_data, ingress_data = get_igor_trace.force_refresh (trace_id) except TypeError as e: print ("Can't process trace {}: {}" .format (trace_id E) continue if access_data ['path']! = "/ app/v1/user/ping": # filtering dirty data continue if' request_time' not in ingress_data: continue def get_int_num (data): # data processing * 10 processing return int (float (data) * 10) # data statistics for each interval It may be a bit wordy and repetitive, but I did enough statistics at that time to ingress_req_time = get_int_num (ingress_data ['request_time']) ingress_upstream_time = get_int_num (ingress_data [' upstream_response_time']) for cal in ingress_cal_map: if ingress_req_time > = cal [0] [0] and ingress_req_time

< cal[0][1]: cal[1] += 1 break nginx_req_time = get_int_num(access_data['request_time']) for cal in nginx_cal_map: if nginx_req_time >

= cal [0] [0] and nginx_req_time

< cal[0][1]: cal[1] += 1 break gap = nginx_req_time - ingress_req_time for cal in nginx_ingress_gap: if gap >

= cal [0] [0] and gap = cal [0] [0] and gap Ingress and Ingress = > upstream, there are varying degrees of delay. For applications over 1 s, about 2max 3 delay comes from Ingress = > upstream,1/3. The delay comes from Nginx = > Ingress.

Further investigation-bag handling

Packet capture survey is mainly aimed at Ingress = > uwsgi. Since packet delay is only sporadic, all packets need to be crawled and filtered. This is a piece of data that has been requested for a long time, and the interface itself should return very quickly.

{"_ source": {"INDEX": "51", "path": "/ app/v1/media/", "referer": "", "user_agent": "okhttp/4.8.1", "upstream_connect_time": "1.288", "upstream_response_time": "1.400", "TIMESTAMP": "1605776490465", "request": "POST / app/v1/media/ HTTP/1.0" "status": "200,200", "proxy_upstream_name": "default-prod-XXX-80", "response_size": "68", "client_ip": "XXXXX", "upstream_addr": "172.32.18.194client_ip", "request_size": "1661", "@ source": "XXXX", "domain": "XXX", "upstream_status": "200", "@ version": "1" "request_time": "1.403", "protocol": "HTTP/1.0", "tags": ["_ dateparsefailure"], "@ timestamp": "2020-11-19T09:01:29.000Z", "request_method": "POST", "trace_id": "87bad3cf9d184df0:87bad3cf9d184df0:0:1"}}

Ingress side packet

Uwsgi side packet

Packet flow

Review the TCP three-way handshake:

First of all, looking from the Ingress side, the connection starts at 21.585446, and at 22.588023, the packet is re-sent.

From the Node side, Node received the syn shortly after the Ingress packet was sent, and immediately returned the syn, but somehow appeared at the Ingress after 1 second.

It's not just ACK packets that are delayed.

Judging from the random capture of packets, it is not only SYN ACK that has been retransmitted:

Some FIN ACK will also, packet delay is a probabilistic behavior!

Summary

Looking at this packet capture alone may only confirm that packet loss has occurred, but if you combine the log requests of Ingress and Nginx, if the packet loss occurs in the TCP connection phase, then in Ingress, we can check the value of upstream_connect_time to roughly estimate the timeout. This is how the records were organized at that time:

I initially guess that this part of the time is mainly spent on TCP connection establishment, because the connection establishment operation exists in both Nginx forwards, and our links all use short connections. In the next step, I'm going to add the $upstream_connect_time variable to record the time it takes to establish a connection. Http://nginx.org/en/docs/http/... .html

Follow-up work

Since we can understand that the TCP connection has been established for a long time, we can use it as a metric. I have also modified the wrk to add the measurement of connection time. For specific PR, see here, we can use this indicator to measure the back-end service.

Look for the boss to see if you have a similar problem

I did the above work several times, and I didn't have a clue, so I went to other Kubernetes bosses in the company to ask questions, and the boss provided an idea:

If the host delay is also high, the path from the host to the container is temporarily excluded. We have checked a delay problem before, because Kubernetes's monitoring tool periodically collects cgroup statistics on cat proc systems. However, due to frequent destruction and reconstruction of Docker and kernel cache mechanism, each cat takes up the kernel for a long time, resulting in network delay. Can you check if there is a similar situation in your host? It is not necessarily cgroup, and other operations that need to be frequently trapped into the kernel can lead to high latency.

This is very similar to the cgroup we checked. There are some periodic tasks on the host. As the number of execution increases, it takes up more and more kernel resources, which affects the network latency to a certain extent.

Bosses also provide a kernel check tool (which can track and locate the time when an interrupt or soft interrupt is off): https://github.com/bytedance/trace-irqoff

Ingress machines with problems have a lot of latency, and many of them report errors like this, and other machines do not have this log:

Then I did a trace of the kubelet in the machine, and it was confirmed from the fire diagram that a lot of time was spent reading kernel information.

The specific code is as follows:

Summary

According to the direction given by the boss, we can basically determine the real cause of the problem: too many scheduled tasks are executed on the machine, and the kernel cache keeps increasing, causing the kernel to slow down. As it slows down, it causes TCP to shake hands longer, resulting in a decline in the user experience. Now that you've found a problem, the solution is easier to find, add tasks, check if the kernel is slow, and clean it up once if it's slow:

Sync & & echo 3 > / proc/sys/vm/drop_caches

At this point, I believe you have a deeper understanding of "how to troubleshoot Kubernetes machine kernel problems". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.