Pod Health check based on K8s Detection Mechanism 07/02 Update SLTechnology News&Howtos

Pod Health check based on K8s Detection Mechanism

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

First, the source of demand:

First of all, let's take a look at the source of the whole demand: how to ensure the health and stability of the application after migrating the application to Kubernetes? In fact, it is very simple and can be enhanced in two ways:

1. The first is to improve the observability of the application.

2, the second is to improve the resilience of the application.

In terms of observability, it can be enhanced in three ways:

1. First of all, the health status of the application can be observed in real time.

2, the second is the resource usage of the application.

3, the third is the real-time log of the application to diagnose and analyze the problem.

When there is a problem, the first thing to do is to reduce the scope of influence, debug and diagnose the problem. Finally, when something goes wrong, the ideal situation is that a complete recovery can be carried out through a self-healing mechanism integrated with K8s.

Second, introduce two detection methods: livenessProbe and ReadnessProbe

LivenessProbe: [activity detection] is to determine whether the container is healthy according to user-defined rules. Also known as Survival pointer. If the Liveness pointer determines that the container is unhealthy, it will kill the corresponding pod through kubelet and determine whether to restart the container according to the restart policy. If the Liveness pointer is not configured by default, the probe default return is considered to be successful by default.

ReadnessProbe: [agile probe], which is used to determine whether the container has been started, that is, whether the status (expected value) of pod is ready. If one of the results of the probe is unsuccessful, it will be removed from the Endpoint on the pod, that is, the previous pod will be removed from the access layer (set pod to unavailable state), and the pod will not be attached to the corresponding endpoint again until the next judgment is successful.

What is Endpoint?

Endpoint is a resource object in the K8s cluster, which is stored in etcd and is used to record the access addresses of all pod corresponding to a service.

2 scenarios for the use of the two detection mechanisms, namely, Readness and Readness:

Liveness pointers are suitable for scenarios that support applications that can be pulled up again, while Readiness pointers mainly deal with those applications that cannot provide services immediately after startup.

(3) the similarities and differences between the two detection mechanisms of Readness and Peregrine:

The same point is to check the health status of pod based on probing an application or file in pod. The difference is that liveness will restart pod if the probe fails, while readliness will set pod to unavailable state after 3 consecutive probe failures, and will not restart pod.

4 the pointer and Readiness pointer support three different detection methods:

1. The first is httpGet. It is judged by sending a http Get request, indicating that the application is healthy when the return code is a status code between 200,399; 2, the second detection method is Exec. It determines whether the current service is normal by executing a command in the container. When the return result of the command line is 0, it indicates that the container is healthy. 3, the third detection method is tcpSocket. It detects the IP and Port of the container for TCP health check, and if the link to the TCP can be established normally, it indicates that the current container is healthy.

The first detection method is very similar to the third detection method, and the first and second detection methods are commonly used.

Third, application examples of detection mechanism: 1 recorder LivenessProbe:

Method 1: use exec probe method to check whether a specified file exists in pod. If so, the status is considered to be healthy, otherwise pod will be restarted according to the set restart policy.

# configuration file of pod:

[root@sqm-master yaml] # vim livenss.yamlkind: PodapiVersion: v1metadata: name: liveness labels: name: livenessspec: restartPolicy: OnFailure # # defines a restart policy to restart containers:-name: liveness image: busybox args:-/ bin/sh-- c-touch / tmp/test; sleep 30; rm-rf / tmp/test only if an error occurs in the pod object Sleep 300 # creates a file and deletes it after 30 seconds. LivenessProbe: # execute activity probe exec: command:-cat # probe / tmp directory contains test files. If so, it means health. If not, restart pod policy is implemented. -/ tmp/test initialDelaySeconds: 10 # how long after the container runs (in s) periodSeconds: 5 # detection frequency (in s), once every 5 seconds.

Other optional fields in the probe mechanism:

InitialDelaySeconds: how many seconds does it take to perform the probe for the first time after the container is started. PeriodSeconds: the frequency at which probes are performed. The default is 10 seconds, with a minimum of 1 second. TimeoutSeconds: probe timeout. Default 1 second, minimum 1 second. SuccessThreshold: after a probe fails, at least how many consecutive probes are successful before they are considered successful. The default is 1. Must be 1 for liveness. The minimum value is 1. FailureThreshold: after a successful probe, at least how many consecutive probe failures can be considered as failures. The default is 3. The minimum value is 1. / / run the pod to test: [root@sqm-master yaml] # kubectl apply-f livenss.yaml pod/liveness created

/ / Monitoring the status of pod:

The probe will begin 10 seconds after the container starts, and will be detected every 5 seconds.

We can see that pod has been rebooting, and RESTARTS can be seen from the image above.

Seven times because the command was executed when pod was started:

/ bin/sh-c "touch / tmp/test; sleep 30; rm-rf / tmp/test; sleep 300"

There is a / tmp/test file within the first 30 seconds of the container's life, during which time the cat / tmp/test command returns a successful return code. However, after 30 seconds, cat / tmp/test will return the failed return code, which will trigger the restart policy of pod.

/ / Let's take a look at pod's Events information: [root@sqm-master ~] # kubectl describe pod liveness

As you can see from the above event, if the probe fails, the container will be restarted because the file was not found in the specified directory.

Method 2: use httpGet probe mode to run a web service to detect whether there are specified files in the root directory of the web page, which is equivalent to "curl-I container ip address: / healthy". The directory / here specifies the home directory where the web service is provided in the container. )

/ / yaml file of pod:

[root@sqm-master yaml] # vim http-livenss.yamlapiVersion: v1kind: Podmetadata: name: web labels: name: mynginxspec: restartPolicy: OnFailure # define pod restart policy containers:-name: nginx image: nginx ports:-containerPort: 80 livenessProbe: # define detection mechanism httpGet: # probe method is httpGet scheme: HTTP # specify protocol path : / healthy # Files under the specified path If it does not exist, the probe fails port: 80 initialDelaySeconds: 10 # after how long the container has been running (in s) periodSeconds: 5 # probe frequency (in s), every 5 seconds-apiVersion: v1 # associate a service object kind: Servicemetadata: name: web-svcspec: selector: name:-protocol: TCP port: 80 targetPort: 80

The httpGet probe method has the following optional control fields:

Host: the hostname of the connection, which is connected to the IP of pod by default. You may want to set "Host" in http header instead of using IP. Scheme: the schema used for the connection. The default is HTTP. Path: the path of the HTTP server visited. HttpHeaders: header of the custom request. HTTP runs a duplicate header. Port: the port name or port number of the container being accessed. The port number must be between 1 and 65525. / / run the pod: [root @ sqm-master yaml] # kubectl apply-f http-livenss.yaml pod/web createdservice/web-svc created

# # View the situation before the pod runs 10 seconds:

# # the container is alive for the first 10 seconds, and the returned status code is 200.

# check the pod again after 10 seconds when the detection mechanism starts to detect:

/ / events of pod viewed: [root @ sqm-master yaml] # kubectl describe pod web

You can see that the returned status code is 404, indicating that the specified file was not found in the root directory of the web page, indicating that the probe failed and restarted 4 times, and the status is completed (completed status), indicating that there is a problem with pod.

2) next, we will continue with the detection to make the final detection successful:

Modify the configuration file for pod:

[root@sqm-master yaml] # vim http-livenss.yaml

/ / rerun pod: [root@sqm-master yaml] # kubectl delete-f http-livenss.yaml pod "web" deletedservice "web-svc" deleted [root@sqm-master yaml] # kubectl apply-f http-livenss.yaml pod/web createdservice/web-svc created// finally we check the status of pod and Events information: [root@sqm-master yaml] # kubectl get pod-o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESweb 1 Running 0 5s 10.244.1.11 node01 [root@sqm-master yaml] # kubectl describe pod web Events: Type Reason Age From Message-Normal Scheduled 71s default-scheduler Successfully assigned default/web to node01 Normal Pulling 71s kubelet Node01 Pulling image "nginx" Normal Pulled 70s kubelet, node01 Successfully pulled image "nginx" Normal Created 70s kubelet, node01 Created container nginx Normal Started 70s kubelet, node01 Started container nginx

It works normally when you can see the status of pod.

# # Test access page header information: [root@sqm-master yaml] # curl-I 10.244.1.11

The returned status code is 200, indicating that the pod is in a healthy condition.

ReadnessProbe probe:

Method 1: use the exec probe method, which is the same as iveness, to detect whether a file exists.

/ / the configuration file of pod is as follows:

[root@sqm-master yaml] # vim readiness.yamlkind: PodapiVersion: v1metadata: name: readiness labels: name: readinessspec: restartPolicy: OnFailure containers:-name: readiness image: busybox args:-/ bin/sh-- c-touch / tmp/test; sleep 30; rm-rf / tmp/test; sleep 300 ReadinessProbe: # define the readiness detection method exec: command:-cat-/ tmp/test initialDelaySeconds: 10 periodSeconds: 5 pod / run the pod: [root @ sqm-master yaml] # kubectl apply-f readiness.yaml pod/readiness created

/ / check the status of pod:

/ / check the Events of pod: [root @ sqm-master yaml] # kubectl describe pod readiness

You can see that the file cannot be found, indicating that the probe failed, but the readiness mechanism is different from the liveness mechanism in that it does not restart pod, but sets the container to an unavailable state after three consecutive probe failures.

Method 2: httpGet method.

[root@sqm-master yaml] # vim http-readiness.yamlapiVersion: v1kind: Podmetadata: name: web2 labels: name: web2spec: containers:-name: web2 image: nginx ports:-containerPort: 81 readinessProbe: httpGet: scheme: HTTP # specify the protocol path: / healthy # specify the path, if it does not exist Then you need to create Otherwise, the probe fails port: 81 initialDelaySeconds: 10 periodSeconds: 5Murray-apiVersion: v1 kind: Servicemetadata: name: selector: name: web2 ports:-protocol: TCP port: 81 targetPort: 81 protocol / run pod: [root@sqm-master yaml] # kubectl apply-f http-readiness.yaml pod/web2 createdservice/web-svc created// to check the status of pod: [root@ Sqm-master yaml] # kubectl get pod-o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESreadiness 0amp 1 Completed 0 37m 10.244.2.12 node02 web 1 Running 0 50m 10.244.1.11 node01 web2 0 Running 0 2m31s 10.244.1.14 node01

Check the Events information of pod. Through probing, you can see that pod is unhealthy and http access failed.

It does not restart, but directly sets the pod to an unavailable state.

Application of health detection in the process of rolling updates:

First, let's look at the fields used by the update through the explain tool:

[root@sqm-master ~] # kubectl explain deploy.spec.strategy.rollingUpdate

You can see that there are two parameters available during the scrolling update:

MaxSurge: this parameter controls that the total number of replicas exceeds the upper limit of the expected number of pod during a rolling update. It can be a percentage or a specific value, which defaults to 1. If this value is set to 3, during the update process, three pod will be added directly to update (of course, the probe mechanism will also be used to verify whether the update is successful). The higher the value, the faster the upgrade, but the more system resources are consumed. MaxUnavailable: this parameter controls the number of pod that is not available during a rolling update. Note that it is reduced in the original number of pod, and the range of maxSurge values is not calculated. If the value is 3, if the probe fails, 3 pod will not be available during the upgrade. The higher the value, the faster the upgrade, but the more system resources are consumed.

Applicable scenarios for maxSurge and maxUnavailable:

1. If you want to upgrade as quickly as possible while ensuring system availability and stability, you can set maxUnavailable to 0 and give maxSurge a large value.

2. If the system resources are tight and the pod load is low, in order to speed up the upgrade, you can set maxSurge to 0 and assign a larger value to maxUnavailable. It should be noted that if maxSurge is 0maxUnavailable and DESIRED, the entire service may be unavailable, and RollingUpdate will degenerate into downtime publishing.

1) first, let's create a deployment resource object:

[root@sqm-master ~] # vim app.v1.yamlapiVersion: extensions/v1beta1kind: Deploymentmetadata: name: my-webspec: replicas: 10 # defines the number of copies as 10 template: metadata: labels: name: my-webspec: containers:-name: my-web image: nginx args:-/ bin/sh- -c-touch / usr/share/nginx/html/test.html Sleep 300000 # create a file to keep the pod healthy when probing ports:-containerPort: 80 readinessProbe: # using the readiness mechanism exec: command:-cat-/ usr/share/nginx/html/test.html initialDelaySeconds: 10 periodSeconds: 10 periodSeconds / after running the pod Check the number of pod (10): [root@sqm-master yaml] # kubectl get pod-o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESmy-web-7bbd55db99-2g6tp 1 Running 0 2m11s 10.244.2.44 node02 my-web-7bbd55db99-2jdbz 1 Running 0 118s 10 .244.2.45 node02 my-web-7bbd55db99- 5mhcv 1/1 Running 0 2m53s 10.244.1.40 node01 my-web-7bbd55db99- 77b4v 1/1 Running 0 2m 10.244.1.44 node01 my-web-7bbd55db99-h888n 1/1 Running 0 2m53s 10.244.2. 41 node02 my-web-7bbd55db99-j5tgz 1/1 Running 0 2m38s 10.244.2.42 node02 my-web-7bbd55db99-kjgm2 1/1 Running 0 2m25s 10.244.1.42 node01 my-web-7bbd55db99-kkmh3 1/1 Running 0 2m38s 10.244.1.41 node01 My-web-7bbd55db99-lr896 1/1 Running 0 2m13s 10.244.1.43 node01 my-web-7bbd55db99-rpd8v 1/1 Running 0 2m23s 10.244.2.43 node02

The probe is successful and all 10 copies are running.

2) first update:

Update the nginx image version and set the rolling update policy:

[root@sqm-master yaml] # vim app.v1.yaml apiVersion: extensions/v1beta1kind: Deploymentmetadata: name: my-webspec: strategy: # set rolling update policy Set rollingUpdate: maxSurge: 3 # through the sub-property of rollingUpdate under this field to specify that a maximum of 3 additional pod maxUnavailable can be created during a rolling update: 3 #-specifies that a maximum of 3 pod is not available during a rolling update: replicas: 10 template: metadata: labels: name: my-web spec: Containers:-name: my-web image: 172.16.1.30:5000/nginx:v2.0 # the updated image is the image in the private repository nginx:v2.0 args:-/ bin/sh-- c-touch / usr/share/nginx/html/test.html Sleep 300000 Ports:-containerPort: 80 readinessProbe: exec: command:-cat-/ usr/share/nginx/html/test.html initialDelaySeconds: 10 periodSeconds: 10 periodSeconds / after executing the yaml file Check the number of pod: [root@sqm-master yaml] # kubectl get pod-o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESmy-web-7db8b88b94-468zv 1 Running 0 3m38s 10.244.2.57 node02 my-web-7db8b88b94-bvszs 1 Running 0 3m24s 10.244.1 . 60 node01 my-web-7db8b88b94-c4xvv 1/1 Running 0 3m38s 10.244.2.55 node02 my-web-7db8b88b94-d5fvc 1/1 Running 0 3m38s 10.244.1.58 node01 my-web-7db8b88b94-lw6nh 1/1 Running 0 3m21s 10.244.2.59 node02 My-web-7db8b88b94-m9gbh 1/1 Running 0 3m38s 10.244.1.57 node01 my-web-7db8b88b94-q5dqc 1/1 Running 0 3m38s 10.244.1.59 node01 my-web-7db8b88b94-tsbmm 1/1 Running 0 3m38s 10.244.2.56 node02 my- Web-7db8b88b94-v5q2s 1 Running 0 3m21s 10.244.1.61 node01 my-web-7db8b88b94-wlgwb 1 Running 0 3m25s 10.244.2.58 node02 / / View pod version information: [root@sqm-master yaml] # kubectl get deployments. -o wideNAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTORmy-web 10ax 10 10 10 49m my-web 172.16.1.30:5000/nginx:v2.0 name=my-web

The probe was successful, and all 10 pod versions were updated successfully.

3) second update:

Update the mirror version to version 3.0 and set the rolling update policy. (probe failed)

The configuration file for pod is as follows:

[root@sqm-master yaml] # vim app.v1.yaml apiVersion: extensions/v1beta1kind: Deploymentmetadata: name: my-webspec: strategy: rollingUpdate: maxSurge: 3 # define update policy The number is still 3 maxUnavailable: 3 replicas: 10 # pod is still 10 template: metadata: labels: name: my-web spec: containers:-name: my-web image: 172.16.1.30:5000/nginx:v3.0 # test image version updated to 3.0 args: -/ bin/sh-- c-sleep 300000 # No longer creating the specified file Make its detection fail ports:-containerPort: 80 readinessProbe: exec: command:-cat-/ usr/share/nginx/html/test.html initialDelaySeconds: 10 periodSeconds: 5 periodSeconds / rerun pod configuration file: [root@sqm-master yaml] # kubectl apply-f app .v1.yaml deployment.extensions/my-web configured// check the number of pod updates: [root@sqm-master yaml] # kubectl get pod-o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESmy-web-7db8b88b94-468zv 1 Running 0 12m 10.244.2.57 node02 my-web-7db8b88b94-c4xvv 1 / 1 Running 0 12m 10.244.2.55 node02 my-web-7db8b88b94-d5fvc 1/1 Running 0 12m 10.244.1.58 node01 my-web-7db8b88b94-m9gbh 1/1 Running 0 12m 10.244.1.57 node01 my-web-7db8b88b94-q5dqc 1/1 Running 0 12m 10.244.1.59 node01 my-web-7db8b88b94-tsbmm 1 Running 0 12m 10.244.2.56 node02 my-web-7db8b88b94-wlgwb 1 Running 0 12m 10.244.2.58 node02 my-web-849cc47979-2g59w 0 3m9s 1 Running 0 10.244.1.63 node01 my-web-849cc47979- 2lkb6 0/1 Running 0 3m9s 10.244.1.64 node01 my-web-849cc47979- 762vb 0/1 Running 0 3m9s 10.244.1.62 node01 my-web-849cc47979-dv7x8 0/1 Running 0 3m9s 10.244.2 . 61 node02 my-web-849cc47979-j6nwz 0/1 Running 0 3m9s 10.244.2.60 node02 my-web-849cc47979-v5h7h 0/1 Running 0 3m9s 10.244.2.62 node02

We can see that the current total number of pod is 13, (including the additional number of maxSurge) because the probe failed, then set 3 pod (including additional pod) to the unavailable state, but there are still 7 pod available (because maxUnavailable is set to 3), but note: the version of these 7 pod has not been updated successfully, or the previous version.

/ / View the updated version information of pod: [root@sqm-master yaml] # kubectl get deployments. -o wideNAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTORmy-web 7ax 10 6 7 58m my-web 172.16.1.30:5000/nginx:v3.0 name=my-web

Parameter explanation:

READY: indicates the expected value of the user

UP-TO-DATE: indicates updated

AVAILABLE: indicates available

We can find that the number of updated mirror versions is 6 (including an additional 3 pod), but it is not available, but make sure that the number of pod available is 7, but the version is not updated.

Summary:

Describe what is the role of the detection mechanism in the process of rolling updates?

If you need to update the pod in an application in a company, if there is no detection mechanism, no matter whether the pod is ready to be updated or not, it will update all the pod in the application, which will cause serious consequences, although you find that the status of the pod is normal after the update, in order to meet the expected value of the controller manager. The value of READY is still 1Compact 1, but pod is already a regenerated pod, indicating that all data in pod will be lost.

If the probe mechanism is added, it will detect whether the files or other applications you specified exist in the container. If the condition you specified is met, then the probe is successful, then your pod will be updated. If the probe fails, the pod (container) will be set to unavailable. Although the container that failed to probe is not available, at least other previous versions of pod will be available in the module to ensure the normal operation of the service of the company. We can see how important the detection mechanism is.

-this is the end of this article. Thank you for reading-

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.