How to use the two sharp weapons of K8s to get rid of the predicament of operation and maintenance 07/11 Update SLTechnology News&Howtos

How to use the two sharp weapons of K8s to get rid of the predicament of operation and maintenance

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to use the two sharp weapons of K8s to get rid of the predicament of operation and maintenance. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Overview

It is believed that the following problems have been encountered by K8s users in their daily cluster operation and maintenance:

An application in the cluster has been deleted. Who did it?

The load of Apiserver suddenly becomes high, and a large number of visits fail. What happens in the cluster?

What is the cause of the cluster node NotReady?

The node of the cluster is automatically expanded. What triggers it? When did it trigger?

In the past, it was not easy for customers to troubleshoot these problems. Kubernetes cluster in production environment is usually a very complex system, the bottom layer is a variety of heterogeneous hosts, networks, storage and other cloud infrastructure, the upper layer carries a large amount of application load, the middle is running a variety of native (e.g. Scheduler, Kubelet) and third-party (e.g. various Operator) components, responsible for managing and scheduling infrastructure and applications In addition, people with different roles frequently deploy applications, add nodes and other operations on the cluster. In the process of cluster operation, in order to know as much as possible what is happening in the cluster, we usually observe the cluster from multiple dimensions.

Log, as one of the three pillars of software observability, provides key clues for understanding the operation status of the system and troubleshooting system faults, and plays a vital role in operation and maintenance management. Kubernetes provides two native forms of logging-Audit and Event, which record access to cluster resources and event information that occurs in the cluster, respectively. From the experience of Tencent Cloud container team's long-term operation and maintenance of K8s cluster, audits and events are not optional. Making good use of them can greatly improve the observability of the cluster and bring great convenience to operation and maintenance. Let's take a brief look at them first.

What is Kubernetes audit?

The Kubernetes audit log is a structured log of configurable policies generated by Kube-apiserver, recording access events to the Apiserver. Audit logs provide another cluster observation dimension in addition to Metrics. By viewing and analyzing audit logs, you can trace changes to the status of the cluster; understand the health status of the cluster; troubleshoot anomalies; and find potential security and performance risks of the cluster.

Audit source

In Kubernetes, all queries and modifications to cluster status are made by sending requests to Apiserver, and the sources of requests for Apiserver can be divided into four categories.

Control plane components, such as Scheduler, various Controller,Apiserver themselves

Various Agent on the node, such as Kubelet, Kube-proxy, etc.

Other services of the cluster, such as Coredns, Ingress-controller, various third-party Operator, etc.

External users, such as operators, through Kubectl

What was recorded in the audit?

Each audit log is a structured record in JSON format, which includes three parts: metadata (metadata), request content (requestObject) and response content (responseObject). Metadata must exist, and the existence of request and response content depends on the audit level. The metadata contains the context information of the request, such as who initiated the request, where it originated, the URI accessed, and so on.

What's the use of auditing?

Apiserver, as the only resource query and change entry for Kubernetes cluster, can be said to record all the pipelines accessed to the cluster. Through it, you can understand the operation status of the entire cluster macroscopically and microscopically, such as:

The resource was deleted, when was it deleted, and by whom?

There is a problem with the service. When did you make a version change?

What causes Apiserver response latency to become longer, or a large number of 5XX response Status Code,Apiserver loads to become higher?

Apiserver returns the 401amp 403 request, whether it is an expired certificate, illegal access, or misconfigured RBAC, etc.

Apiserver receives a large number of requests for access to sensitive resources from the public network IP. Whether such requests are reasonable and whether there are security risks.

What is the Kubernetes event?

Event is one of the many resource objects in Kubernetes, which is usually used to record state changes in the cluster, ranging from cluster node exceptions to Pod startup, successful scheduling and so on. Our commonly used kubectl describe command can view the event information of related resources.

What was recorded in the incident?

Level (Type): currently there are only "Normal" and "Warning", but custom types can be used if needed.

Resource type / object (Involved Object): the object involved in the event, such as Pod,Deployment,Node, and so on.

Event source (Source): the component that reports this event, such as Scheduler, Kubelet, and so on.

Content (Reason): a short description of the current event, usually an enumerated value, mainly used within the program.

Detailed description (Message): detailed description of the current event.

Number of occurrences (Count): the number of times the event occurred.

What's the use of the incident?

The situation inside the cluster has been turned upside down, but it is calm outside the cluster. this may be the situation that we often encounter in our daily cluster operation and maintenance. if the situation in the cluster is not perceived through events, we are likely to miss the best time to deal with the problem. when the problem expands and affects the business, it is often too late. In addition to early detection of problems, Event is also the best helper to troubleshoot problems. Because Event records comprehensive cluster state change information, most cluster problems can be troubleshooting through Event. To sum up, Event plays two important roles in the cluster:

Whistler: when an abnormal situation occurs in the cluster, the user can perceive it at the first time through the event

"Witness": large and small events in the cluster will be recorded through Event. If unexpected events occur in the cluster, such as abnormal node status and Pod restart, you can find the time point and cause of the occurrence through the event.

How to explore the value of Audit / event by TKE

The traditional way to use audit and events by inputting query statements to retrieve logs can provide high flexibility, but it also has a high threshold for use, which requires users not only to be very familiar with the data structure of logs, but also to be familiar with Lucene and SQL syntax. This often leads to inefficient use and inability to fully explore the value of the data.

Tencent Cloud CCS TKE and Tencent Cloud Log Service CLS create an one-stop product-level service for Kubernetes audit / event collection, storage, retrieval and analysis, which not only provides one-click enable / disable function, but also eliminates all tedious configuration. And the container team also summarizes the best practices for the use of Kubernetes audit / events from the experience of long-term operation and maintenance of massive clusters. Through visual charts, audit logs and cluster events are presented in multiple dimensions. Users only need to understand the basic concept of K8s, they can intuitively carry out various retrieval and analysis operations on the TKE console, which is sufficient to cover most common cluster operation and maintenance scenarios. Make it possible to get twice the result with half the effort whether to find or locate problems, improve the efficiency of operation and maintenance, and really maximize the value of audit and event data.

How do I use the TKE audit / event service to troubleshoot problems?

For an overview of TKE's cluster audit / event and basic operations, please refer to the official documentation of cluster audit and event storage.

Example scenario:

Let's take a look at some typical real-world scenarios.

Example 1: troubleshooting a missing workload

On the audit retrieval page, click the [K8s object Operation Overview] tab, and specify the operation type and resource object.

The query results are shown in the following figure:

As can be seen from the figure, the account number is 10001, and the application "nginx" has been deleted. You can find more information about this account in * * access Management * *-> * * user list * * according to your account ID.

Example 2: troubleshooting a node that is blocked

On the audit retrieval page, click the "Overview of Node Operations" tab and fill in the blocked node name

The query results are shown in the following figure:

As can be seen from the figure, the account 10001 was blocked on 172.16.18.13 at 2020-1-30T06:22:18.

Example 3: troubleshoot slow Apiserver response

In the "aggregation Retrieval" tab of audit retrieval, the Apiserver access aggregation trend chart is provided from multiple dimensions, such as user, operation type, return status code, and so on.

As can be seen from the figure, the visit volume of user tke-kube-state-metrics is much higher than that of other users, and most of them are list operations in the "Operation Type Distribution trend" chart, and the status can be seen in the "status Code Distribution trend" chart.

Most of the codes are 403. Combined with the business log, we can see that due to RBAC authentication problems, tke-kube-state-metrics components keep requesting Apiserver retries, resulting in a sharp increase in Apiserver access. The log is as follows:

E113006VOV 19VA 37.368981 1 reflector.go:156] pkg/mod/k8s.io/client-go@v0.0.0-20191109102209-3c0d1af94be5/tools/cache/reflector.go:108: Failed to list * v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:tke-kube-state-metrics" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope example 4: troubleshoot node exceptions

An exception occurs on a Node node. On the event retrieval page, click "event Overview" and enter the name of the exception node in the filter item.

The query results show that an event with insufficient disk space on a node records the query results as shown below:

Take a closer look at abnormal event trends

It can be found that from 2020-11-25, node 172.16.18.13 caused a node exception due to insufficient disk space, and then kubelet began to attempt to expel the pod on the node to reclaim the node disk space.

Example 5: find the reason that triggers node expansion

If the cluster with node pool "auto scaling" is enabled, the CA (cluster-autoscler) component will automatically increase or decrease the number of nodes in the cluster according to the load. If the nodes in the cluster expand (shrink) automatically, the user can trace the whole expansion (reduction) process through event retrieval.

On the event retrieval page, click "Global search" and enter the following search command:

Event.source.component: "cluster-autoscaler"

Select event.reason, event.message, event.involvedObject.name and event.involvedObject.name in the hidden field on the left to display the query results in reverse order according to the log time. The results are shown below:

Through the event pipelining in the figure above, you can see that the node expansion operation takes place around 20:35:45 from 2020 to 11 to 25, which is triggered by three nginx Pod (nginx-5dbf784b68-tq8rd, nginx-5dbf784b68-fpvbx, nginx-5dbf784b68-v9jv5). Finally, three nodes are expanded. Subsequent expansion is not triggered again because the maximum number of nodes in the node pool is reached.

After reading the above, do you have any further understanding of how to use the two sharp weapons of K8s to get rid of the operation and maintenance dilemma? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.