How to use CCS TKE cluster audit to troubleshoot problems 04/08 Update SLTechnology News&Howtos

How to use CCS TKE cluster audit to troubleshoot problems

2025-04-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to use CCS TKE cluster audit to troubleshoot problems, which may not be well understood by many people. In order to make you understand better, the editor summarizes the following contents for you. I hope you can get something from this article.

Overview

Sometimes, cluster resources are deleted or modified inexplicably, which may be caused by human misoperation, or it may be caused by an application's bug or malicious program calling the apiserver API, so it is necessary to find out the "real culprit". At this point, we need to start the audit for the cluster, record the apiserver interface calls, and then retrieve and analyze the audit log according to the conditions to find the reason.

For the brief introduction and basic operation of TKE cluster audit, please refer to the official documentation of cluster audit. Because the data of the cluster audit is stored in the log service, we need to retrieve and analyze the audit results in the log service console. For the retrieval syntax, please refer to the log retrieval syntax and rules. To do the analysis, you also need to write the SQL statements supported by the log service. Please refer to the introduction to the log service analysis.

Note: only applicable to TKE clusters

Scenario exampl

Here are some examples of cluster audit usage scenarios and queries.

Find out who did it.

If the node is blocked and you don't know which application or human operation it is, you need to find out. You can use the following statement to retrieve it after the cluster audit is enabled:

ObjectRef.resource:nodes AND requestObject:unschedulable

Layout settings can be set to display three fields: user.username, requestObject and objectRef.name, indicating the user doing the operation, the content of the request and the name of the node.

As can be seen from the above figure, it is the sub-account 10001users who blocked the main.63u5qua9.0 node at 16:13:22 from 2020 to 10-09. We can find more information about this sub-account based on the account ID in the access Management-user-user list.

If a workload is deleted and you want to know who deleted it, take deployments/nginx as an example to query:

ObjectRef.resource:deployments AND objectRef.name: "nginx" AND verb: "delete"

Query results:

Find out the real culprit of apiserver frequency limit

Apiserver will be protected by default request frequency limit to prevent malicious programs or bug from making apiserver requests too frequently, resulting in excessive apiserver/etcd load and affecting normal requests. If frequency restriction occurs, we can audit to find out who is sending a large number of requests.

If we use userAgent to analyze the clients of statistics requests, we first need to modify the key index of the log topic to enable statistics for the userAgent field:

Use the following SQL statement to count the QPS size of each client request apiserver:

* | SELECT CAST ((_ _ TIMESTAMP_US__ / 1000murf / 1000murf) timestamp usages _ / 100000) as TIMESTAMP) AS time, COUNT (1) AS qps,userAgent GROUP BY time,userAgent ORDER BY time

Switch to icon analysis, select a line chart, use qps for time,Y axis for X axis, and use userAgent for aggregate column:

You can see that the data has been found, but there may be too many results for the small panel to display. Click add to the dashboard and enlarge the display:

In this example, you can see that the kube-state-metrics client requests apiserver much more frequently than other clients, so it is found that the "real culprit" is kube-state-metrics. If you check the log, you can find that kube-state-metrics keeps requesting apiserver to retry because of the RBAC right issue, which triggers the frequency limit of apiserver:

I1009 13:13:09.760767 1 request.go:538] Throttling request took 1.393921018s Request: GET: https://172.16.252.1:443/api/v1/endpoints?limit=500&resourceVersion=1029843735E1009 13 reflector.go:156 13 reflector.go:156 09.766106 1] pkg/mod/k8s.io/client-go@v0.0.0-20191109102209-3c0d1af94be5/tools/cache/reflector.go:108: Failed to list * v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:monitoring:kube-state-metrics" cannot list resource "endpoints" in API group "" at the cluster scope

By the same token, if you want to use other fields to distinguish the clients to be counted, you can modify SQL flexibly according to your needs, such as using user.username to distinguish. SQL reads as follows:

* | SELECT CAST ((_ _ TIMESTAMP_US__ / 1000murf / 1000murf) timestamp usages _ / 100000) as TIMESTAMP) AS time, COUNT (1) AS qps,user.username GROUP BY time,user.username ORDER BY time

Display effect:

After reading the above, do you have any further understanding of how to use CCS TKE cluster audit to troubleshoot problems? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.