Getting started with K8s from scratch | observability: is your application healthy? 04/19 Update SLTechnology News&Howtos

Getting started with K8s from scratch | observability: is your application healthy?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Author | Mo Yuan Alibaba technical expert

I. sources of demand

First of all, let's take a look at the source of the whole demand: how to ensure the health and stability of the application after migrating the application to Kubernetes? In fact, it is very simple and can be enhanced in two ways:

The first is to improve the observability of the application, and the second is to improve the resilience of the application.

In terms of observability, it can be enhanced in three ways:

The first is the health status of the application, which can be observed in real time; the second is that the resource usage of the application can be obtained; and the third is that the real-time log of the application can be obtained for problem diagnosis and analysis.

When there is a problem, the first thing to do is to reduce the scope of influence, debug and diagnose the problem. Finally, when something goes wrong, the ideal situation is that a complete recovery can be carried out through a self-healing mechanism integrated with K8s.

II. Liveness and Readiness

This section introduces you to Liveness probe and eadiness probe.

Application Health status-first acquaintance of Liveness and Readiness

Liveness probe, also known as the ready pointer, is used to determine whether an pod is in a ready state. When a pod is in a ready state, it can provide corresponding services, that is to say, the traffic of the access layer can reach the corresponding pod. When the pod is not ready, the access layer removes the corresponding traffic from the pod.

Let's look at a simple example:

The following figure is actually an example of Readiness ready:

Cdn.com/7972a7447a196a56fc78ede6a4fe04b0161e045c.png ">

When the pod pointer judgment has been in a failed state, in fact, the traffic of the access layer will not reach the current pod.

When the state of the pod changes from the state of the FAIL to the state of the success, it can actually carry the traffic.

Similarly, the Liveness pointer is a survival pointer, which is used to determine whether a pod is alive or not. What happens when a pod is in a non-living state?

At this time, it is up to the upper judgment mechanism to determine whether the pod needs to be pulled again. If the restart strategy configured by the upper layer is restart always, then the pod will be pulled back directly at this time.

Application Health status-how to use

Next, take a look at the specific use of Liveness pointers and Readiness pointers.

Detection mode

Liveness pointer and Readiness pointer support three different detection methods:

The first is httpGet. It determines by sending a http Get request, indicating that the application is healthy when the return code is a status code between 200and 399; the second detection method is Exec. It determines whether the current service is normal by executing a command in the container. When the return result of the command line is 0, it indicates that the container is healthy; the third detection method is tcpSocket. It detects the IP and Port of the container for TCP health check, and if the link to the TCP can be established normally, it indicates that the current container is healthy. Detection result

In terms of detection results, there are three main types:

The first is success, which means that container has passed the health check when the status is success, that is, Liveness probe or Readiness probe is a normal state; the second is that Failure,Failure indicates that the container failed the health check, and if it fails the health check, then a corresponding processing will be carried out at this time, and one way to deal with Readiness is through service. The service layer will not be removed through Readiness's pod, and Liveness will either pull up the pod or delete it. The third state is that Unknown,Unknown means that the current execution mechanism has not carried out a complete execution, perhaps because something like timeout or some scripts did not return in time, then Readiness-probe or Liveness-probe will not do any operation at this time and will wait for the next mechanism to check.

Then there is a component called ProbeManager in kubelet, which will contain Liveness-probe or Readiness-probe. These two probe will apply the corresponding Liveness diagnosis and Readiness diagnosis on pod to achieve a specific judgment.

Application Health status-Pod Probe Spec

The following describes the use of a yaml file in three different ways of detection.

First of all, let's take a look at how easy it is to use exec,exec. As shown in the following figure, you can see that this is a Liveness probe, which is configured with a diagnosis of an exec. Next, it configures a command field, which cat a specific file to determine the status of the current Liveness probe. When the result returned in this file is 0, or when the command returns 0, it will think that the pod is in a healthy state.

Let's take a look at this httpGet,httpGet. One field is the path, the second field is port, and the third field is headers. In this place, when you sometimes need to use a mechanism like the header header to make a health judgment, you need to configure the header. Usually, you only need to use health and port.

The third is that the use of tcpSocket,tcpSocket is actually relatively simple. You only need to set up a port for detection. For example, port 8080 is used in this example. When this port 8080 tcp connect audit is established normally, the tecSocket,Probe will consider it to be a healthy state.

In addition, there are the following five parameters, which are the parameters of Global.

The first parameter is called initialDelaySeconds, which indicates how often the pod startup delay is checked. For example, if there is an Java application, it may take a long time to start because it involves the startup of jvm, including the loading of Java's own jar. So in the early stage, there may be no way to be detected for a period of time, and this time is predictable, then it may be necessary to set initialDelaySeconds; the second is periodSeconds, which represents the detection interval, and the normal default value is 10 seconds; the third field is timeoutSeconds, which represents the detected timeout period, and when the detection is not successful within the timeout period, it will be considered a state of failure The fourth is successThreshold, which means that when the pod fails to judge whether the probe is successful again, the threshold number required by default is once, indicating that it was originally failed. Then if the probe is successful this time, the pod will be considered to be in a normal state of a probe. The last parameter is failureThreshold, which represents the number of failed retries of the probe. The default value is 3, which means that when probing from a healthy state fails three times in a row, the current state of the pod will be judged to be in a failed state. Application Health status-Liveness and Readiness Summary

Next, give a simple summary of Liveness pointers and Readiness pointers.

Introduction

The Liveness pointer is a survival pointer, which is used to determine whether the container is alive and whether the pod is running. If the Liveness pointer determines that the container is unhealthy, the corresponding pod will be killed through kubelet, and the container will be restarted according to the restart policy. If the Liveness pointer is not configured by default, the probe default return is considered to be successful by default.

The Readiness pointer is used to determine whether the container has been started, that is, whether the condition of the pod is ready. If one of the results of the probe is unsuccessful, it will be removed from the Endpoint on the pod, that is, the previous pod will be removed from the access layer, and the pod will not be attached to the corresponding endpoint again until the next judgment is successful.

Detection failed

For the detection failure, the Liveness pointer directly kills the pod, while the Readiness pointer cuts off the relationship between the endpoint and the pod, that is, it cuts off the traffic from the pod.

Applicable scenario

Liveness pointers are suitable for scenarios that support applications that can be pulled up again, while Readiness pointers mainly deal with those applications that cannot provide services immediately after startup.

Matters needing attention

There are some considerations when using Liveness pointers and Readiness pointers. Because both Liveness pointers and Readiness pointers need to be configured with appropriate detection methods to avoid misoperation.

The first is to increase the timeout threshold, because the execution time of a shell script in the container is very long, usually executed on an ecs or on a vm, and a script returned in 3 seconds may take 30 seconds in the container. Therefore, this time needs to be judged in advance in the container, so if the timeout threshold can be adjusted to prevent occasional timeouts when the pressure of the container is relatively high.

The second is to adjust the number of judgments. in fact, the default value of 3 times is not necessarily the best practice under a relatively short period of judgment cycle, and it is also a better way to adjust the number of judgments properly.

The third is exec. If you use the judgment of shell script, the call time will be relatively long. It is suggested that you can use binary binary compiled by some compiled scripts like Golang or some C language or C++ to judge. This is usually 30 to 50 percent more efficient than shell scripts.

The fourth is that if you encounter TLS services when using tcpSocket, there may be a lot of unsound tcp connection in the subsequent TLS. At this time, you need to judge whether such links will affect the business. III. Problem diagnosis

Next, let's talk about the diagnosis of common problems in K8s.

Apply troubleshooting-understand the status mechanism

First of all, we need to understand one of the design concepts in K8s, which is this state mechanism. Because K8s is a state machine-oriented design, it defines an expected state through the way of yaml, while the real yaml will be carried out by a variety of controller to be responsible for a transition between the overall states.

For example, the figure above is actually a life cycle of a Pod. At first it is in a pending state, then it may switch to something like running, or to Unknown, or even to failed. Then, after running has been executed for a period of time, it can switch to something like successded or failed, and then when it appears in the state of unknown, it may revert back to running or successded or failed due to some state recovery.

In fact, a state of K8s as a whole is based on a mechanism similar to a state machine, and the transition between different states will be represented by fields like Status or Conditions on the corresponding K8s object.

The picture like the one below actually represents the presentation of some status bits on a Pod.

For example, on Pod there is a field called Status, and this Status represents an aggregate state of Pod, in which the aggregate state is in a pending state.

Then look down, because there are multiple container in a pod, there will be a field called State on top of each container, and then the status of the State represents an aggregate status of the current container. So in this example, the aggregation state is in the state of waiting, so what is the specific reason? It is because its image is not pulled down, so it is in the state of waiting and is waiting for the image to be pulled. Then, the part of the ready is currently false, because it has not been pulled down at present, so the pod cannot be served normally, so the state of the ready is unknown and is defined as false. If the upper endpoint finds that the underlying ready is not true, then there is no way for the service to provide external services at this time.

Next, the condition,condition mechanism means that there are many smaller states in K8s, and the aggregation between these states will become the upper Status. So how many states are there in this example? the first one is Initialized, indicating whether it has been initialized or not. In this example, it has already been initialized, so it is going to the second stage, which is in the state of ready. Because the above container did not pull down the corresponding image, the status of ready is false.

Then you can see whether the container is ready. Here you can see that it is false, and this status is PodScheduled, indicating whether the current pod is in a state that has been scheduled. It is already bound above the current node, so this state is also true.

We can judge whether the state above the whole is a normal state by whether the corresponding condition is true or false. In K8s, the transition between different states will occur corresponding events, and there are two kinds of events: one is called normal event, the other is warning event. You can see that the event in the first item is a normal event, and then its corresponding reason is scheduler, which means that the pod has been dispatched to the corresponding node by the default scheduler, and then the node is above the cn-beijing192.168.3.167 node.

Next, there is another normal event, indicating that the current image corresponds to the image in pull. Then there is a warning event, the warning event indicating that the pull image failed.

By analogy, one state represented by this place is that the state transition between the state mechanisms in K8s produces a corresponding event, which in turn is exposed in a way like normal or warning. Through a mechanism like this event, developers can judge the specific status of the current application and make a series of diagnoses through a series of fields corresponding to the upper condition Status.

Application troubleshooting-common application exceptions

This section describes some exceptions to common applications. First of all, there are several common states on pod, which may remain on pod.

Pod stays in Pending

The first is the pending state, where pending indicates that the scheduler is not involved. At this point, you can view the corresponding events through kubectl describe pod. If the pod cannot be scheduled due to resources or ports, or due to node selector, you can see the corresponding results in the corresponding events. This result will indicate how many unsatisfied node, how many are due to CPU dissatisfaction, how many are due to node dissatisfaction, and how many are caused by tag marking.

Pod stays in waiting

The second state is that pod may stay in the state of waiting. When the states of pod is in waiting, it usually means that the image of the pod is not pulled normally. The reason may be that the image is private, but the Pod secret; is not configured. The second is that the image may not be pulled because the image address does not exist. Another is that the image may be an image of a public network, causing the pull of the image to fail.

Pod is constantly being pulled and crashing can be seen

The third is that pod is constantly pulled up, and you can see something like backoff. This usually means that the pod has been scheduled, but the startup fails. At this time, you should pay attention to the status of the application itself, not whether the configuration and permissions are correct. You should check the specific log of pod at this time.

Pod is in Runing but is not working properly

The fourth kind of pod is in the running state, but there is no normal external service. Well, one of the more common points at this time may be due to some very detailed configurations, such as some fields may be misspelled, causing the yaml to be sent down, but there is a paragraph that does not take effect normally, so that the pod is in the state of running without external services. At this time, you can judge whether the current yaml is normal by means of apply-validate-f pod.yaml, if there is no problem with the yaml. Then you may want to diagnose whether the configured port is normal and whether the Liveness or Readiness has been configured correctly.

Service does not work properly

The last one is how to judge when service doesn't work properly. When there is a problem with the more common service, there is a problem with its own use. Because the association between service and the underlying pod is matched by selector, that is, some label is configured on pod, and then service is associated with this pod through match label. If there is a problem with the configuration of the label, it may cause the service to be unable to find the subsequent endpoint, resulting in no way for the corresponding service to provide services. If there is an exception in the service, the first thing to see is whether there is a real endpoint behind the service. Secondly, let's see whether the endpoint can provide normal services.

IV. Application of remote debugging

This section explains how to debug the application in K8s. Remote debugging is mainly divided into pod remote debugging and service remote debugging. There is also remote debugging for some performance optimization.

Application remote debugging-Pod remote debugging

First of all, when you deploy an application to a cluster, when you find a problem, you need to quickly verify it, or when you modify it, you may need to log in to this container for some diagnosis.

For example, you can enter a pod through exec. Like this command, fill in a corresponding command after kubectl exec-it pod-name, such as / bin/bash, to indicate that you want to enter an interactive bash into this pod. Then you can do some corresponding commands in bash, such as modifying some configuration and re-pulling the application through supervisor.

What if you specify that this pod may contain multiple container? How to specify container through pod? In fact, there is a parameter called-c at this time, as shown in the command at the bottom of the figure above. -c is followed by a container-name, which can be specified from-c to the container-name by pod, and which container is to be entered, followed by the corresponding specific command. In this way, the entry of a multi-container command is realized, thus a multi-container remote debugging is realized.

Application remote debugging-Servic remote debugging

So what about remote debugging of service? Remote debugging of service is actually divided into two parts:

The first part is that I want to expose a service to a remote cluster, allowing some applications in the remote cluster to invoke a local service, which is a reverse link; another way is that I want this local service to be able to call remote services, so this is a forward link.

In the reverse listing, there is an open source component called Telepresence that proxies local applications to a service in a remote cluster in a very simple way.

First, deploy a Proxy application of Telepresence to a remote K8s cluster. Then transfer a single remote deployment swap to a local application, using the command Telepresence-swap-deployment and then the remote DEPLOYMENT_NAME. In this way, a local application can be proxied to the remote service, and the application can be debugged locally in the remote cluster. This interested student can go to GitHub to see how this plug-in is used.

The second is that if the local application needs to call the service of the remote cluster, the remote application can be called to the local port by port-forward. For example, now there is an API server in the remote, and this API server provides some ports. When you debug the Code locally, you want to call the API server directly. At this time, a relatively simple way is through port-forward.

It is used by kubectl port-forward, and then service plus remote service name, plus the corresponding namespace, followed by some additional parameters, such as a mapping of the port, through this mechanism, you can proxy a remote application to the local port, and at this time you can access the remote service by accessing the local port.

An open source debugging tool-kubectl-debug

Finally, I will introduce you to an open source debugging tool, which is also a plug-in for kubectl, called kubectl-debug. We know that in K8s, the underlying container runtime is more common, such as docker or containerd, whether it is docker or containerd, they use a mechanism based on Linux namespace for virtualization and isolation.

Usually, there are not many debugging tools in the image, such as netstat telnet, etc., because this will make the application very redundant as a whole. So what do you do when you want to debug? In fact, you can rely on a tool like kubectl-debug at this time. Kubectl-debug this tool depends on the way of Linux namespace to do, it can datash a Linux namespace to an extra container, and then in this container to perform any debug action, in fact, and directly to debug this Linux namespace is the same. Here is a simple operation to introduce to you:

This place is actually installed with kubectl-debug, which is a plug-in for kubectl. So at this point, you can diagnose a remote pod directly through the kubectl-debug command. In this example, when debug is executed, it actually pulls some images first, and this image actually comes with some diagnostic tools by default. When the image is enabled, it starts the debug container. At the same time, the container will be linked to the corresponding namespace of the container you want to diagnose, that is to say, at this time, the container and you are the same as namespace, like a network station, or some parameters like the kernel, which can actually be viewed in this debug container in real time.

In this example, check things like hostname, process, netstat, etc., all of which are in the same environment as the pod that requires debug, so you can see the relevant information in the previous three commands.

If logout is carried out at this time, it is equivalent to killing the corresponding debug pod and then exiting, which actually has no effect on the application. Then in this way, a corresponding diagnosis can be achieved without intervening in the container.

In addition, it also supports some additional mechanisms, for example, I set some image, and then something like htop is installed here, and then developers can use this mechanism to define the command line tools they need, and set them in this way of image. At this time, you can debug a remote pod through this mechanism.

This section summarizes

Pointers to Liveness and Readiness. Liveness probe is the keep alive pointer, it is used to see whether the pod is alive, while Readiness probe is the ready pointer, it determines whether the pod is ready, and if so, it can provide services to the outside world. This is what Liveness and Readiness need to remember.

There are three steps to apply diagnosis: first, describe a corresponding status; then provide a status to troubleshoot a specific diagnosis direction; finally, look at an event of the corresponding object for more detailed information

Provide pod with a log to locate a state of the application itself

As a strategy of remote debugging, if you want to proxy the local application to the remote cluster, you can use tools like Telepresence to achieve it. If you want to proxy the remote application locally, and then call or debug it locally, you can use a mechanism like port-forward.

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.