Mining pit sharing of micro-service production environment for Kubernetes operation and maintenance 10/27 Update SLTechnology News&Howtos

Mining pit sharing of micro-service production environment for Kubernetes operation and maintenance

2025-10-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Production environment experience

1. Limit container resources and often be killed? 2. The importance of health check for rolling updates. 3. Loss of traffic for rolling updates.

Let's start with the first question, why do you limit container resources and often kill them?

That is to say, the deployed java application is restarted soon, in fact, the restart is rebuilding, which means that your pod is unhealthy, and then K8s will help you pull it again, so you have to find the problem to troubleshoot. To put it bluntly, it is actually killed. You can check the event through describe. Generally, you can see that because the health check failed, and then pull it, because it is the java app. Because the heap memory overflowed and was dropped by kill, a kill field will appear in the last line of the log to see why it is restarted. What I encountered before is that its heap memory is not limited. It has jvm,jvm memory that mainly has memory to exchange data. Its heap memory is mainly a manifestation of performance design, so its heap memory can easily be exceeded. After that, it is likely to be killed by K8s. Why does K8s kill it? because it exceeds the limit, the default container will use all the resources of the host. If there is no resource limit, it will affect the entire host, and then the entire host will drift and transfer to other hosts, and then an exception may have an avalanche effect. So generally, we have to make resource restrictions. Can't this limit limit java applications? Although K8s is still very convenient to deploy applications, the deployment of java applications is not compatible.

For example, it cannot recognize the limit of the current java container, that is, it cannot identify the limit we specify, that is, what you limit in yaml, the heap memory of the container has no limit, it will exceed this limit. If it exceeds the limit of limits, K8s will kill it. K8s itself has this strategy, and if it exceeds this capacity limit, it will help you kill it and then help you pull it up.

In the face of a little burst of heap memory like java, data utilization may come up, so this range is relatively large, which will lead to K8s being killed, and then pulled up. This cycle may have this effect hundreds of times a day.

When it comes to this problem, how to solve the resource limit of docker? in fact, this resource limit is still done by this docker, but K8s converts it into, but K8s calls the interface to do some restrictions. In fact, how to let docker identify the java heap memory limit to solve this problem.

There are two ways to solve this problem, that is, to reconfigure the use of heap memory for this java

Configure the use of java heap memory

Java-Xmx's largest heap memory usage, one is-Xms is the initial heap memory usage

Generally, a maximum heap memory usage is set. If the setting is not set beyond this setting, one of the host's memory will continue to be used, resulting in insufficient physical memory and heap memory overflow, which is very common. We use this java-Xmx, that is, when it is almost full, it will have a garbage collection and then recycle, which can ensure the stable operation of the java application. It is certainly not enough for us to configure resource limits in yaml. We must set the heap memory for this java. It is impossible for us to write this manually in Dockerfile. Generally, we pass this value in dockerfile and set a variable in the yaml file.

Env:-name: JAVA_OPTS value: "- Xmx1g"

Here are some restrictions on the container resources we configured earlier, and this variable will be passed into pod, that is, the container that builds the image, that is, the variable passed in $JAVA_OPTS under the container CMD command, which will call our system variable, which has been assigned a value, so it can directly drink this variable, go to the application, and set the heap memory size. This value is recommended to be a little smaller than limlts, a small 10%, because exceeding this limits limit will kill and pull again.

Generally set up, rebuild the image, go to the container to check the process, and you can see that others are also set up in the same way.

The second question is the importance of rolling updated health check-ups.

Rolling update is the default policy of K8s, which is usually used first after we deploy to K8s. When you configure a health check, rolling update will determine whether to continue to update allowed access traffic based on the status of probe, that is, whether your current application provides services. In this way, when you roll through the update process, you will make sure that you have a node available to ensure a smooth upgrade. So this is the beginning of rolling update settings, and that health check is very important.

What is the role of health checks when rolling updates are initiated?

Include a copy. It takes one minute to provide services after startup. For example, java starts slowly. If there is no health check to ensure that it is ready, it is directly thought that it is ready. During this period, it cannot provide services within a minute, so the new traffic must not be handled. This is the first case, and the second case is due to human configuration errors. For example, if you can't connect to the database, or you can't connect to other places, or where the configuration file is written wrong, then trigger a rolling update. Well, the pod, uh, all the rolling updates are completed, and the result is all caused by the problem. In this case, the new copy will replace the old copy. In this case, the consequences in the production environment will be very serious, and many services will not be able to provide them, so when configuring rolling updates, Health check must be matched. After health check is matched, new traffic will not be forwarded until the new copy check is completed. If it does not pass, it will not all be replaced, that is, it will not continue to update, because it has a limit on available nodes. If the number of available nodes does not reach this number, it will not continue to update. There are two types of health check. Readinessprobe is ready check. There are also different ways to check these two methods, such as http, detecting a url, and probing tcp socket, port, and executing a shell command, executing a shell command to determine a return value, so the provider has three ways to check the health status. Readinessprobe ready check means that your Pod check fails. If it is http, you can use a page to detect and determine the return status code. If the local port cannot be detected, it will not allow it to join behind service, because service serves as your entire unified access entry. If it does not pass, new traffic will not be forwarded to it. This is the readiness check. If the health status does not pass, it will not forward new traffic to you. Another is initialDelaySeconds:60, which is checked for 60 seconds, because it usually takes about a minute for java applications to start. There is also a periodSeconds:10, which fails to do it again in 10 seconds, and then there is livenessProbe: survival check.

That is, if the check fails, it will kill the container. According to your restart strategy, it is usually rebuilding, and then pull a new container for you, and then determine whether it is ready or not. The judgment method is also based on the detection of the port, or you can use the other two methods, http,exec, so generally, these two should be configured. The readiness check is not to assign new traffic to you, and the survival check is to help you pull it again.

Last question: the lost traffic of rolling updates

Generally speaking, the connection is rejected, the response is incorrect, and the call is not available.

The general rolling update is to close the existing pod and start a new pod. To close the existing one is to delete a pod, and then the apiserver will notify the kubelet, and then the kubelet will close the container, remove it from the service backend, do not distribute the new traffic, then remove it, and then tell kube-proxy that you can deal with the new forwarding rules and schedule them to the node. In fact, this is also a pod offline cycle.

In addition, in the process of forwarding the new pod, there is an interval. After closing the pod, there will be a waiting time. At this time, some new traffic may be connected, but its service no longer handles new requests, so it will cause the connection to be rejected. How to solve this problem? in fact, the readiness probe plays a key role in the whole process. Once endpoint receives the deletion event of pod, it has nothing to do with the result of readiness detection.

How do you make sure it handles it gracefully?

In fact, you need to add a sleep time when shutting down the pod. In fact, you can solve this problem. There is a hook in both the shutdown and startup, so you can execute the hook before closing the container. A shell,y can also define a http request, that is, two types of supporters, that is, in the container sibling, env.

Hibernate for 5 seconds, that is, the container you closed will not exit immediately, then hibernate for 5 seconds, and then close the application. These 5 seconds will be enough for kube-proxy to refresh this rule, so that newly added traffic will not be forwarded to the newly closed pod. Adding this hook will delay the time for you to close pod, thus allowing kube-proxy to increase the time for refreshing rules.

Add

Lifecycle: preStop: exec: command:-sh-c-"sleep 5"

In this way, you do not need to modify the code of your application, so that rolling updates will not be forwarded to the soon-to-be-closed pod, so you can also solve this related problem.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.