How to deploy graceful downtime and zero downtime in K8s 11/01 Update SLTechnology News&Howtos

How to deploy graceful downtime and zero downtime in K8s

2025-11-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about how to carry out elegant downtime and zero downtime deployment in K8s. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

Creating and deleting Pod is one of the most common tasks in K8s. The following describes the internal process that occurs when Pod responds to create and delete requests, and discusses how to prevent disconnection when Pod starts or shuts down, and how to close long-running tasks normally.

In Kubernetes, creating and deleting Pod is arguably one of the most common tasks. Pod is created when we do rolling updates, scale-out deployments, and so on. In addition, when we mark the node as unschedulable, the Pod will be deleted and recreated after it is expelled. The lifetime of these Pod is very short. What if the Pod is shut down while it is still responding to the request? Has the request before closing been completed? What about the next request? Before we discuss what happens when we delete a Pod, we need to know what happens when we create a Pod. Suppose we created the following Pod in the cluster:

We submit the Pod YAML definition to the cluster:

After entering the command, kubectl submits the Pod definition to Kubernetes API.

K8sMeetup

Save the cluster state in the database

API receives and examines the Pod definition and stores it in the etcd database. In addition, Pod will be added to the queue of the scheduler.

The scheduler examines the Pod definition, collects detailed information about the workload, such as CPU and memory requests, and determines which node is best suited to run it. At the end of the scheduler: the Pod in etcd is marked as Scheduled. Pod is assigned to a node. The state of the Pod is stored in the etcd. However, the Pod still does not exist at this time, because the previous tasks occurred in the control plane, and the Pod state is only stored in the database. So how do we create a Pod in the node? K8sMeetup

Kubelet

Kubelet's job is to poll the control plane for updates. Instead of creating its own Pod, kubelet leaves the work to the other three components:

Container Runtime Interface (CRI): a component that creates a container for Pod. Container Network Interface (CNI): a component that connects a container to a cluster network and assigns an IP address. Container Storage Interface (CSI): a component that mounts a volume in a container. In most cases, the container runtime interface (CRI) works like this:

The container network interface (CNI) is responsible for generating a valid IP address for the Pod. Connect the container to the network. There are several ways to connect the container to the network and assign a valid IP address. We can choose between IPv4 or IPv6, or we can assign multiple IP addresses. When the container network interface completes its work, the Pod also connects to the network and assigns a valid IP address. There is a problem here. Kubelet knows the IP address because it invokes the container network interface, but the control plane does not know. The primary node also does not know that the Pod has been assigned an IP address and is ready to receive traffic. From a purely control plane point of view, the Pod creation phase is still in progress. Kubelet's job is to collect all the details of the Pod, such as the IP address, and report it back to the control plane. We check the etcd to show not only where the Pod is running, but also its IP address.

If Pod is not part of any Service, that's it, because the Pod has been created and ready to use, but if Pod is part of Service, there are a few more steps to perform.

K8sMeetup

Pod and Service

When creating a Service, we need to pay attention to two pieces of information:

Selector: specifies the Pod to receive traffic. TargetPort: traffic is received through the Pod port. The YAML definition for Service is as follows:

When we use kubectl apply to submit Service to the cluster, Kubernetes will find all the Pod with the same label as the selector (name: app) and collect their IP addresses, of course, they need to go through the Readiness probe and then connect each IP address to the port. If the IP address is 10.0.0.3 and targetPort is 3000, Kubernetes concatenates the two values as endpoint.

The endpoint is stored in an object called Endpoint in etcd. One thing to note here: endpoint (e lowercase) = IP address + port (10.0.0.3 virtual 3000). Endpoint (E uppercase) is a collection of endpiont. The Endpoint object is a real object in Kubernetes, and an Endpoint object is automatically created for each Service,Kubernetes. We can use the following methods to verify:

The Endpoint object collects all IP addresses and ports from the Pod, and not once. The Endpoint object updates a new list of endpiont when the Pod is created. When Pod is deleted. When you modify the label on Pod. Therefore, each time a Pod is created and its IP address is sent by kubelet to the primary node, Kubernetes updates all endpoint:

The endpoint is stored in the control plane, and the Endpoint object is updated.

K8sMeetup

Using endpoint in Kubernetes

Endpoint is used by multiple components in Kubernetes. Kube-proxy uses endpoint to set iptables rules on nodes. Therefore, each time a change is made to the Endpoint object, kube-proxy retrieves the IP address and a new list of endpiont to write a new iptables rule.

The same endpiont list is used by the Ingress controller. The Ingress controller is the component in the cluster that routes external traffic to the cluster. When setting up the Ingress list, we usually specify Service as the target:

In fact, traffic is not routed to the Service,Ingress controller where subscription is set, and each time the endpoint of the Service changes, the Ingress will route the traffic directly to the Pod, skipping the Service. As you can imagine, every time you change the Endpoint object, Ingress retrieves the IP address and a new list of endpoint, and reconfigures the controller. Now let's take a quick look back at what happened when you created the Pod: the 1.Pod is first stored in the etcd. two。 The scheduler allocates a node and writes it to the etcd. 3. Notify kubelet of a new Pod. 4.kubelet gives CRI the task of creating the container. 5.kubelet attaches the container to the CNI. 6.kubelet delegates volumes in the container to CSI. 7.CNI assigns IP addresses. 8.Kubelet notifies the control plane of the IP address. 9.IP addresses are stored in etcd. If our Pod belongs to Service:1.Kubelet, wait for the Readiness probe to succeed. two。 Notify all relevant Endpoint object changes. 3.Endpoint adds a new endpoint (IP address + port) to the list. 4.Kube-proxy is notified of the Endpoint change, and Kube-proxy updates the iptables rules on each node. The 5.Ingress controller is notified of the Endpoint change, and the controller routes the traffic to the new IP address. 6.CoreDNS is notified of the Endpoint change. If the service type is Headless,DNS, it will be updated. 7. The cloud provider is notified of the Endpoint change. If Service is type: LoadBalancer, the new endpoint configuration will be part of the load balancing pool. 8. All service grids installed in the cluster are also notified of Endpoint changes. 9. Other operators that subscribe to Endpoint changes are also notified. Although the list is long, it's actually a common task: create a Pod. Pod has run successfully, so let's discuss what happens when you delete it. K8sMeetup

Delete Pod

When deleting a Pod, we follow the same steps above, but vice versa. First, we remove the endpiont from the Endpoint object, but this time the "readiness" probe is ignored, the endpiont is immediately removed from the control plane, and then all events are triggered to the kube-proxy,Ingress controller, DNS, service grid, and so on. These components update their internal state and stop routing traffic to IP addresses.

Because the component may be busy with other operations, there is no guarantee how long it will take to remove the IP address from its internal state. Sometimes it may take less than a second, but sometimes it may take more time. At the same time, the status of Pod in etcd changes to Termination. Kubelet will be notified of the change: 1. The volume attached to the CSI will be unmounted from the container. two。 Detach the container from the network and release the IP address to CNI. 3. Destroy the container to CRI. In other words, the Kubernetes follows exactly the same but reverse steps as creating the Pod. In fact, there is a slight difference. When we terminate the Pod, both the endpoint and the signal sent to the kubelet will be deleted.

When you create a Pod, Kubernetes waits for the kubelet to report the IP address and then broadcasts the endpoint, but when you delete the Pod, these events start in parallel. This may lead to some conditional competition. What if I delete the Pod before the endpoint broadcast?

K8sMeetup

Elegant shutdown

When Pod terminates before the kube-proxy or Ingress controller is removed, we may encounter downtime. At this point, Kubernetes still routes traffic to the IP address, but Pod no longer exists. The Ingress controller, kube-proxy, CoreDNS, and so on do not have enough time to remove the IP address from their internal state.

Ideally, Kubernetes should wait for all components in the cluster to update the endpoint list before deleting Pod, but Kubernetes doesn't work that way. Kubernetes provides primitives to distribute endpoint (that is, Endpoint objects and more advanced abstractions, such as Endpoint Slices), so Kubernetes does not verify that components subscribing to endpoint changes are the latest cluster state information. So, how do you avoid this competitive situation and ensure that the Pod is deleted after the endpoint broadcast? We need to wait, and when Pod is about to be deleted, it will receive a SIGTERM signal. Our application can capture the signal and start shutting it down. Since endpoint is not immediately removed from all components of Kubernetes, we can: 1. Please wait a moment and then quit. two。 Even if there is a SIGTERM signal, it can still handle incoming traffic. 3. Finally, close the existing long-term connection. 4. Close the process. So how long should we wait? By default, Kubernetes sends a SIGTERM signal and waits for 30 seconds before forcing the process to terminate. Therefore, we can use the first 15 seconds to continue. This interval should be sufficient to propagate endpoint deletion information to kube-proxy, Ingress controller, CoreDNS, and so on, and then less and less traffic will reach Pod until it stops. After 15 seconds, we can safely close the connection to the database and terminate the process. If we think we need more time, we can stop the process at 20 or 25 seconds. One thing to note here is that Kubernetes will forcibly terminate the process after 30 seconds (unless we change the terminationGracePeriodSeconds in the Pod definition). What if we can't change the code to get a longer wait? We can call the script to get a fixed wait time, and then exit the application. Before calling SIGTERM, Kubernetes exposes a preStop hook in Pod. We can set preStop hook to wait 15 seconds. Here is an example:

PreStop hook is one of the Pod LifeCycle hook.

K8sMeetup

Grace period and rolling updates

Elegant downtime works for the Pod to be deleted, but what if we don't delete the Pod? In fact, even if we don't do it, Kubernetes will delete Pod. Each time a newer version of the application is deployed, Kubernetes creates and deletes the Pod.

When you change the image in Deployment, Kubernetes changes it step by step.

If we have three copies and submit a new YAML resource, Kubernetes will:

1. Create a Pod with the new container image.

two。 Destroy the existing Pod.

3. Wait for Pod to be ready.

It repeats the above steps until all Pod is migrated to a newer version. Kubernetes repeats each cycle after the new Pod is ready to receive traffic. In addition, Kubernetes does not wait for the Pod to be deleted before transferring the Pod. If we have 10 Pod and Pod takes 2 seconds to prepare and 20 seconds to shut down, the following occurs:

1. Create a Pod that terminates the previous Pod.

After 2.Kubernetes creates a new Pod, it takes 2 seconds to prepare.

3. At the same time, the terminated Pod will have a stop time of 20 seconds.

After 20 seconds, all new Pod are enabled and the previous 10 Pod will be terminated. In this way, we doubled the number of Pod in a short time (running 10 times, terminating 10 times). The longer the grace period, the more Pod will have both "run" and "terminate". K8sMeetup

Terminate a long-running task

If we want to transcode a large video, is there any way to delay stopping Pod?

Suppose we have a Deployment with three copies. Each copy is assigned a video transcoding task, which may take several hours to complete. When we trigger a scrolling update, Pod completes the task in 30 seconds and then kills it. How can I avoid delaying shutting down Pod? We can increase its terminationGracePeriodSeconds to a few hours, but then Pod's endpoint will unreachable. If we expose metrics to monitor Pod,instrumentation, we will not be able to access Pod. Tools such as Prometheus rely on Endpoints to scrape Pod in the cluster. Once the Pod,endpoint is deleted, the deletion information will be propagated in the cluster, even to the Prometheus. Instead of increasing the grace period, we should create a new Deployment for each new version. When we create a completely new Deployment, the existing Deployment will remain the same. Long-running jobs can continue to process videos as usual, and we can delete them manually when they are finished.

If we want to delete automatically, then we need to set up an automatic scaler so that when they complete the task, we can expand the Deployment to zero copies.

K8sMeetup

We should note that after Pod is removed from the cluster, their IP addresses may still be used to route traffic. Instead of shutting down Pod immediately, we might as well wait or set up a preStop hook in the application. Pod is not removed until the endpoint is propagated to the cluster and the Pod is removed from the kube-proxy, Ingress controller, CoreDNS, and so on.

If our Pod runs long-term tasks such as video transcoding, we can consider using Rainbow deployment. In the Rainbow deployment, we create a new Deployment for each release and delete the previous release when the task is complete. This is how to deploy elegant downtime and zero downtime in K8s shared by the editor. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.