Health examination of K8s (Health Check) 07/02 Update SLTechnology News&Howtos

Health examination of K8s (Health Check)

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Strong self-healing ability is an important feature of container orchestration engines such as Kubernetes. The default implementation of self-healing is to automatically restart the failed container. In addition, users can also use Liveness and Readiness detection mechanisms to set more detailed health checks to achieve the following requirements:

Zero downtime deployment. Avoid deploying invalid images. A more secure rolling upgrade.

Let's learn the Health Check function of Kubernetes through practice.

Default health check

Let's first learn the default health check mechanism of Kubernetes:

Each container starts with a process specified by the CMD or ENTRYPOINT of Dockerfile. If the return code is non-zero when the process exits, the container is considered to have failed and Kubernetes restarts the container based on restartPolicy.

Let's simulate a container failure scenario. The Pod configuration file is as follows:

ApiVersion: v1kind: Podmetadata: labels: test: healthcheck name: healthcheckspec: restartPolicy: OnFailure containers:-name: healthcheck image: busybox args:-/ bin/sh-- c-sleep 10; exit 1

The restartPolicy for Pod is set to OnFailure, and the default is Always.

Sleep 10; exit 1 simulates a failure 10 seconds after the container starts.

Execute kubectl apply to create a Pod, named healthcheck.

# kubectl apply-f healthcheck.ymlpod/healthcheck created

Check the status of Pod in a few minutes:

# kubectl get pod healthcheck NAME READY STATUS RESTARTS AGEhealthcheck 0/1 CrashLoopBackOff 4 3m39s

You can see that the container has been restarted 4 times.

In the above example, the value returned by the container process is non-zero, while Kubernetes believes that the container has failed and needs to be restarted. However, there are many cases in which the failure occurs, but the process does not exit. For example, 500 internal errors are displayed when accessing the Web server, which may be caused by system overload or resource deadlock, and the httpd process does not exit abnormally. In this case, restarting the container may be the most direct and effective solution. How can we use the Health Check mechanism to deal with such scenarios?

The answer is Liveness detection, which we'll learn in the next section.

Liveness probe

The Liveness probe allows users to customize the conditions that determine whether the container is healthy or not. If the probe fails, Kubernetes restarts the container.

As an example, create the following Pod:

ApiVersion: v1kind: Podmetadata: labels: test: liveness name: livenessspec: restartPolicy:-name: liveness image: busybox args:-/ bin/sh-- c-touch / tmp/healthy; sleep 30 is RM-rf / tmp/healthy; sleep 600 livenessProbe: exec: command:-cat-/ tmp/healthy initialDelaySeconds: 10 periodSeconds: 5

The startup process first creates the file / tmp/healthy,30 and then deletes it. In our setting, if the / tmp/healthy file exists, the container is considered to be in a normal state, but a failure occurs anyway.

The livenessProbe section defines how to perform Liveness probes:

The way to probe is to check the existence of the / tmp/healthy file through the cat command. If the command executes successfully, the return value is zero, and Kubernetes thinks that the Liveness probe is successful; if the command returns a non-zero value, the Liveness probe fails. InitialDelaySeconds: 10 specifies that the Liveness probe will be executed after the container starts 10. We usually set it according to the preparation time for the application to start. For example, if it takes 30 seconds for an application to start normally, the value of initialDelaySeconds should be greater than 30. PeriodSeconds: 5 specifies that the Liveness probe is performed every 5 seconds. If Kubernetes fails to perform three Liveness probes in a row, it will kill and restart the container.

Create the Pod liveness below:

# kubectl apply-f liveness.yamlpod/liveness created

As you can see from the configuration file, the / tmp/healthy exists for the first 30 seconds, and the cat command returns a successful detection of 0 kubectl describe pod liveness Lifetime. During this period, the Events section of the log will display a normal log.

# kubectl describe pod liveness Events: Type Reason Age From Message-Normal Scheduled 31s default-scheduler Successfully assigned default/liveness to k8s-node2 Normal Pulling 30s kubelet, k8s-node2 Pulling image "busybox" Normal Pulled 30s kubelet, k8s-node2 Successfully pulled image "busybox" Normal Created 30s kubelet K8s-node2 Created container liveness Normal Started 29s kubelet, k8s-node2 Started container liveness

After 35 seconds, the log shows that / tmp/healthy no longer exists and the Liveness probe fails. After dozens of seconds, after several failed probes, the container will be restarted.

Events: Type Reason Age From Message-Normal Scheduled 47s default-scheduler Successfully assigned default/liveness to k8s-node2 Normal Pulling 46s kubelet K8s-node2 Pulling image "busybox" Normal Pulled 46s kubelet, k8s-node2 Successfully pulled image "busybox" Normal Created 46s kubelet, k8s-node2 Created container liveness Normal Started 45s kubelet, k8s-node2 Started container liveness Warning Unhealthy 3s (x3 over 13s) kubelet, k8s-node2 Liveness probe failed: cat: can't open'/ tmp/healthy': No such file or directory Normal Killing 3s kubelet K8s-node2 Container liveness failed liveness probe, will be restarted# kubectl get pod liveness NAME READY STATUS RESTARTS AGEliveness 1/1 Running 1 76s

In addition to Liveness probes, the Kubernetes Health Check mechanism also includes Readiness probes.

Readiness probe

Users can use Liveness probe to tell Kubernetes when to restart the container to achieve self-healing, while Readiness probe tells Kubernetes when the container can be added to the Service load balancer pool to provide services.

The configuration syntax of Readiness probe is exactly the same as that of Liveness probe. Here is an example:

ApiVersion: v1kind: Podmetadata: labels: test: readiness name: readinessspec: restartPolicy: OnFailure containers:-name: readiness image: busybox args:-/ bin/sh-- c-touch / tmp/healthy; sleep 30; rm-rf / tmp/healthy; sleep 600 readinessProbe: exec: command:-cat-/ tmp/healthy initialDelaySeconds: 10 periodSeconds: 5

This configuration file simply replaces liveness in the previous example with readiness, and let's see what the difference is.

[root@k8s-master ~] # kubectl get pod readiness NAME READY STATUS RESTARTS AGEreadiness 0 kubectl get pod readiness NAME READY STATUS RESTARTS AGEreadiness 1 Running 0 10s [root@k8s-master ~] # kubectl get pod readiness NAME READY STATUS RESTARTS AGEreadiness 1 Running 0 20s [root@k8s-master ~] # kubectl get pod readiness NAME READY STATUS RESTARTS AGEreadiness 1 kubectl get pod readiness NAME READY STATUS RESTARTS AGEreadiness 1 Running 035s [root@k8s-master ~] # kubectl get pod readiness NAME READY STATUS RESTARTS AGEreadiness 0amp 1 Running 0 61s [root@k8s-master ~] # kubectl describe pod readiness

The READY state of Pod readiness has undergone the following changes:

When it was first created, the READY status was unavailable. 15 seconds later (initialDelaySeconds + periodSeconds), the Readiness probe is performed for the first time and returns successfully, setting READY to available. After 30 seconds, / tmp/healthy is deleted, and after three consecutive Readiness probes fail, READY is set to unavailable.

You can also see the log of failed Readiness probes through kubectl describe pod readiness.

Events: Type Reason Age From Message-Normal Scheduled 95s default-scheduler Successfully assigned default/readiness to k8s-node2 Normal Pulling 94s kubelet K8s-node2 Pulling image "busybox" Normal Pulled 94s kubelet, k8s-node2 Successfully pulled image "busybox" Normal Created 93s kubelet, k8s-node2 Created container readiness Normal Started 93s kubelet, k8s-node2 Started container readiness Warning Unhealthy 4s (x12 over 59s) kubelet, k8s-node2 Readiness probe failed: cat: can't open'/ tmp/healthy': No such file or directory

Here's a comparison between Liveness probe and Readiness probe:

Liveness probe and Readiness probe are two Health Check mechanisms. If they are not specifically configured, Kubernetes will adopt the same default behavior for both probes, that is, to determine whether the probe is successful by judging whether the return value of the container startup process is zero. The configuration methods of the two probes are exactly the same, and the supported configuration parameters are the same. The difference lies in the behavior after probe failure: Liveness probe restarts the container, while Readiness probe sets the container to be unavailable and does not receive requests forwarded by Service. Liveness probe and Readiness probe are performed independently, and there is no dependency between them, so they can be used alone or at the same time. Use Liveness probe to determine whether the container needs to be restarted to achieve self-healing; use Readiness probe to determine whether the container is ready to provide services.

Use Health Check in a business scenario.

Using Health Check in Scale Up

For multi-copy applications, when performing the Scale Up operation, the new copy will be added to the Service load balancer as backend to process the customer's request together with the existing copy. Considering that application startup usually requires a preparation phase, such as loading cached data, connecting to the database, etc., it takes a period of time from container startup to being able to provide services. We can use the Readiness probe to determine whether the container is ready to avoid sending the request to a backend that does not yet have a ready.

The following is the configuration file for the sample application.

ApiVersion: apps/v1beta1kind: Deploymentmetadata: name: webspec: replicas: 3 template: metadata: labels: run: webspec: containers:-name: web image: myhttpd ports:-containerPort: 8080 readinessProbe: httpGet: scheme: HTTP path: / healthy port: 8080 initialDelaySeconds: 10 periodSeconds : 5---apiVersion: v1kind: Servicemetadata: name: web-svcspec: selector: run: web ports:-protocol: TCP port: 8080 targetPort: 80

Focus on the readinessProbe part. Here we use another detection method different from exec-httpGet. The condition for Kubernetes to determine the success of this method is that the return code of the http request is between 200,400.

Schema specifies the protocol and supports HTTP (default) and HTTPS.

Path specifies the access path.

Port specifies the port.

The purpose of the above configuration is:

The probe begins 10 seconds after the container starts.

If the http://[container_ip]:8080/healthy return code is not 200-400, the container is not ready and does not receive requests from Service web-svc.

Detect it again every five seconds.

Until the return code is 200-400, indicating that the container is ready, and then adding it to the responsible balance of web-svc to start processing customer requests.

The probe will continue to be executed at an interval of 5 seconds. If three consecutive failures occur, the container will be removed from the load balancer until the next probe is successfully rejoined.

For http://[container_ip]:8080/healthy, the application can implement its own judgment logic, such as checking whether the dependent database is ready. The sample code is as follows:

① defines the handler for / healthy.

② connects to the database and executes the test SQL.

③ test successful, normal return, code 200.

The ④ test failed with error code 503.

⑤ listens on port 8080.

It is recommended to configure Health Check for important applications in the production environment to ensure that the containers that handle customer requests are ready Service backend.

In Rolling Update, if you apply.

Using Health Check in Rolling Update

The previous section discussed the application of Health Check in Scale Up, and another important application scenario of Health Check is Rolling Update. Consider the following situation:

If you have a running multi-copy application, and then update the application (such as using a later version of image), Kubernetes will start the new copy, and the following event occurs:

Normally, it takes 10 seconds for the new copy to complete the preparation, and the business request cannot be responded to until then. However, due to human configuration errors, the replica is always unable to complete the preparation work (such as unable to connect to the back-end database).

Question to consider: what happens if Health Check is not configured?

Because the new copy itself does not exit abnormally, the default Health Check mechanism assumes that the container is ready, and then gradually replaces the existing copy with the new copy. As a result, when all the old copies are replaced, the whole application will not be able to process requests and provide services. If this happens in an important production system, the consequences will be very serious.

If Health Check is configured correctly, the new copy will be added to the Service; only if it passes the Readiness probe. If it does not pass the probe, all the existing copies will not be replaced and the business will still proceed normally.

The following is an example to practice the application of Health Check in Rolling Update.

Simulate a 10-copy application with the following configuration file app.v1.yml:

ApiVersion: apps/v1beta1kind: Deploymentmetadata: name: appspec: replicas: 10 template: metadata: labels: run: appspec: containers:-name: app image: busybox args:-/ bin/sh-- c-sleep 10; touch / tmp/healthy Sleep 30000 readinessProbe: exec: command:-cat-/ tmp/healthy initialDelaySeconds: 10 periodSeconds: 5

After 10 seconds, the copy can be detected by Readiness.

# kubectl get deployments. AppNAME READY UP-TO-DATE AVAILABLE AGEapp 0 25sapp-6dd7f876c4-bx4pf 10 0 8s# kubectl get podNAME READY STATUS RESTARTS AGEapp-6dd7f876c4-575v5 1 Running 1 Running 0 25sapphire 6dd7f876c4-9kwk9 1 Running 0 25sapp-6dd7f876c4-f6qf2 1 Running 0 25sapp-6dd7f876c4-fxp2m 1/1 Running 0 25sapp-6dd7f876c4-k76mr 1/1 Running 0 25sapp-6dd7f876c4-mfqsq 1/1 Running 0 25sapp-6dd7f876c4-whkc7 1/1 Running 0 25sapp-6dd7f876c4-x9q87 1/1 Running 0 25sapp-6dd7f876c4-xf8dv 1/1 Running 0 25s

Next, scroll to update the application. The configuration file app.v2.yml is as follows:

ApiVersion: apps/v1beta1kind: Deploymentmetadata: name: appspec: replicas: 10 template: metadata: labels: run: appspec: containers:-name: app image: busybox args:-/ bin/sh-- c-sleep 3000 readinessProbe: exec: command:-cat -/ tmp/healthy initialDelaySeconds: 10 periodSeconds: 5

Obviously, since / tmp/healthy does not exist in the new copy, it cannot be detected by Readiness. Verify as follows:

# kubectl apply-f app.yml-- record deployment.apps/app configured [root@k8s-master] # kubectl get deployments. AppNAME READY UP-TO-DATE AVAILABLE AGEapp 8 2m3sapp-6dd7f876c4-f6qf2 10 5 8 80s# kubectl get podNAME READY STATUS RESTARTS AGEapp-6dd7f876c4-575v5 1 Running 1 Running 0 2m3 sapphire 6dd7f876c4-9kwk9 1 Running 0 2m3sapp-6dd7f876c4-fxp2m 1 Running 0 2m3sapp-6dd7f876c4-k76mr 1 + 1 Running 0 2m3sapp-6dd7f876c4-whkc7 1 * Running 0 49sapp-7d7559dd99-n59vq 0/1 Running 0 49sapp-7d7559dd99-t49cp 0/1 Running 0 49s

This screenshot contains a lot of information and is worthy of our detailed analysis.

Take a look at the kubectl get pod output first:

Judging from the AGE column of Pod, the last five Pod are new copies and are currently in the NOT READY state. The number of old copies decreased from the first 10 to 8.

Let's look at the output of kubectl get deployment app:

DESIRED 10 indicates that the desired state is a copy of 10 READY. CURRENT 13 represents the total number of current copies: that is, 8 old copies + 5 new copies. UP-TO-DATE 5 indicates the current number of copies that have completed the update: that is, 5 new copies. AVAILABLE 8 represents the number of copies currently in the READY state: that is, eight old copies.

In our setup, the new copy will never be able to be detected by Readiness, so this state will remain.

Above we simulate a scenario in which scrolling updates fail. Fortunately, however, Health Check shielded us from defective copies while keeping most of the old ones, and the business was not affected by update failures.

Next, we have to answer: why is it that the number of newly created copies is 5, while only 2 old copies are destroyed?

The reason is that scrolling updates control the number of replica replacements through the parameters maxSurge and maxUnavailable.

MaxSurge

This parameter controls that the total number of replicas during a rolling update exceeds the upper limit of DESIRED. MaxSurge can be a specific integer (such as 3), or it can be 100%, rounded up. The default value for maxSurge is 25%.

In the above example, if DESIRED is 10, then the maximum number of copies is:

RoundUp (10 + 10 * 25%) = 13

So we see that CURRENT is 13.

MaxUnavailable

This parameter controls the maximum proportion of unavailable replicas to the DESIRED during a rolling update. MaxUnavailable can be a specific integer (such as 3), or it can be 100%, rounded down. The default value for maxUnavailable is 25%.

In the above example, if DESIRED is 10, then the number of copies available must be at least:

10-roundDown (10 * 25%) = 8

So we see that AVAILABLE is 8.

The higher the maxSurge value, the more new copies are initially created, and the higher the maxUnavailable value, the more old copies are initially destroyed.

Ideally, the rolling update process for our case would look like this:

First create 3 new copies to bring the total number of copies to 13. Then destroy 2 old copies so that the number of available copies is reduced to 8. When the 2 old copies are destroyed successfully, 2 more new copies can be created, keeping the total number of copies at 13. When the new copy passes the Readiness probe, it increases the number of available copies by more than 8. In turn, you can continue to destroy more old copies, bringing the number of available copies back to 8. The destruction of old copies makes the total number of copies less than 13, which allows for the creation of more new copies. This process will continue, and eventually all the old copies will be replaced by the new ones, and the rolling update will be completed.

Our actual situation is that it is stuck in step 4, and the new copy cannot be detected by Readiness. This process can be seen in the log section of kubectl describe deployment app. Events: Type Reason Age From Message-Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set app-6dd7f876c4 to 10 Normal ScalingReplicaSet 10m deployment-controller Scaled up replica set app-7d7559dd99 to 3 Normal ScalingReplicaSet 10m deployment-controller Scaled down replica set app-6dd7f876c4 to 8 Normal ScalingReplicaSet 10m deployment-controller Scaled up replica set app-7d7559dd99 to 5

If the scrolling update fails, you can roll back to the previous version through kubectl rollout undo.

# kubectl rollout history deployment appdeployment.extensions/app REVISION CHANGE-CAUSE1 kubectl apply-filename=app.yml-record=true2 kubectl apply-filename=app.yml-record=true# kubectl get deployments. AppNAME READY UP-TO-DATE AVAILABLE AGEapp 8 14mapp-6dd7f876c4-f6qf2 10 5 8 14m kubectl get podNAME READY STATUS RESTARTS AGEapp-6dd7f876c4-575v5 1 Running 0 14mappLay 6dd7f876c4-9kwk9 1 9kwk9 1 Running 0 14mapp-6dd7f876c4-f6qf2 1 Running 0 14mapp-6dd7f876c4-fxp2m 1 max 1 Running 0 14mapp-6dd7f876c4-k76mr 1 + 1 Running 0 14mapp-6dd7f876c4-whkc7 1 * Running 0 13mapp-7d7559dd99-n59vq 0/1 Running 0 13mapp-7d7559dd99-t49cp 0/1 Running 0 13m

If you want to customize maxSurge and maxUnavailable, you can configure them as follows:

ApiVersion: apps/v1beta1kind: Deploymentmetadata: name: appspec: strategy: rollingUpdate: maxSurge: 35% maxUnavailable: 35% replicas: 10 template: metadata: labels: run: appspec: containers:-name: app image: busybox args:-/ bin/sh-- c-sleep 3000 readinessProbe: exec: Command:-cat-/ tmp/healthy initialDelaySeconds: 10 periodSeconds: 5

Summary

This chapter discusses two mechanisms of Kubernetes health check: Liveness probe and Readiness probe, and practices the application of health check in Scale Up and Rolling Update scenarios.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.