Avoid the 10 most common pits in daily Kubernetes 07/12 Update SLTechnology News&Howtos

Avoid the 10 most common pits in daily Kubernetes

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

What errors do you encounter when using Kubernetes? This article shares the author's 10 most common mistakes in using Kubernetes over the years.

Over the years with kubernetes, we have seen countless clusters (both managed and unmanaged, on GCP, AWS, and Azure) and a lot of recurring errors. We have made most of these mistakes ourselves, which is nothing to be ashamed of!

This article will show you some of the problems we often encounter and talk about ways to fix them.

1. Resources: requests and restrictions

This is undoubtedly the most noteworthy and number one on this list.

People often do not set CPU requests or set CPU requests too low (so that we can hold a lot of Pod on each node), resulting in node overuse (overcommited). When the demand is high, the CPU of the node is running at full load, and our load can only get "what it requests" data, which makes CPU throttling (throttled), resulting in an increase in indicators such as application latency and timeout.

BestEffort (don't do this):

Resources: {}

Very low cpu (don't do this):

Resources:requests:cpu: "1m"

On the other hand, enabling the CPU restriction may unnecessarily throttle the Pod when the node's CPU is not fully utilized, which can also lead to increased latency. People have also talked about CPU CFS quotas in the Linux kernel and CPU throttling issues caused by setting CPU limits and turning off CFS quotas. CPU restrictions may cause more problems than it can solve. For more information, please see the link below.

Overuse of memory will bring us more trouble. Reaching the CPU limit will result in throttling, and reaching the memory limit will result in Pod being killed. Have you ever seen OOMkill (killed for running out of memory)? That's what we're talking about. Want to minimize this kind of situation? Then don't overuse memory and use Guaranteed QoS (Quality of Service) to set the memory request equal to the limit, as in the following example. For more information, please refer to the presentation by Henning Jacobs (Zalando).

Https://www.slideshare.net/try_except_/optimizing-kubernetes-resource-requestslimits-for-costefficiency-and-latency-highload

Burstable (easy to bring more OOMkilled):

Resources:requests:memory: "128Mi" cpu: "500m" limits:memory: "256Mi" cpu: 2

Guaranteed:

Resources:requests:memory: "128Mi" cpu: 2limits:memory: "128Mi" cpu: 2

So what are our tricks when setting up resources?

We can use metrics-server to see the current CPU and memory usage of Pod (and the containers in it). You may have enabled it. Simply run the following command:

Kubectl top podskubectl top pods-containerskubectl top nodes

However, these only show current usage. This is enough to get an overview of the data, but at the end of the day we want to see these usage indicators in time (to answer questions such as the peak of CPU usage yesterday morning). We can use tools such as Prometheus and DataDog to do this. They just receive the measurement data from metrics-server and store it, and then we can query and plot the data.

VerticalPodAutoscaler can help us automate this manual process-checking cpu/ memory usage in a timely manner and setting new requests and limits based on that data.

Https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler

It is not easy to make effective use of computing resources, just like playing Tetris all the time. If we find ourselves spending a lot of money on computing resources, but the average utilization is very low (for example, about 10%), then we may need AWS Fargate or Virtual Kubelet-based products. They mainly use a serverless / pay-as-you-use billing model, which may be cheaper for us.

2. Liveness and readiness probes

By default, Kubernetes does not specify any liveness and readiness probes. Sometimes it stays like this all the time...

But if there is an unrecoverable error, how will our service be restarted? How does the load balancer know that a particular Pod can start processing traffic or handle more traffic?

People usually don't know the difference between the two.

If the probe fails, the liveness probe will restart Pod

When the Readiness probe fails, the failed Pod is disconnected from the Kubernetes service (we can check this with kubectl get endpoints), and no traffic is sent to the Pod until the probe returns to normal.

Both of them run throughout the Pod lifecycle. This is very important.

It is generally believed that the readiness probe runs only at the beginning to determine when the Pod Ready and can start processing traffic. But this is just one of its use cases.

Another use case is to determine whether it is too hot to handle too much traffic (or an expensive calculation) during the life cycle of a Pod, so that we don't let it do more work, but let it cool down; when the readiness probe succeeds, we will send it more traffic. In this case (when the readiness probe fails), if the liveness probe also fails, it will greatly affect the efficiency. Why should we restart a healthy Pod that is doing a lot of work?

Sometimes it is better not to specify any probe than to specify the wrong probe. As mentioned above, if the liveness probe is equal to the readiness probe, we will have a lot of trouble. We may only specify the readiness probe at first, because the liveness probe is too dangerous.

Https://twitter.com/sszuecs/status/1175803113204269059

Https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

If any of your shared dependencies fail, don't let any of the probes fail, or it will cause cascading failures of all Pod. We're shooting ourselves in the foot.

Https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/

3. Enable load balancer on all HTTP services

There may be many HTTP services in our cluster, and we want to expose these services to the outside world.

If we expose the Kubernetes service in the form of type: LoadBalancer, its controller (depending on the vendor) will provide and coordinate an external load balancer (not necessarily L7, but more likely L4 lb); when we create many of these resources, they can become expensive (external static ipv4 addresses, calculations, pay-per-second... ).

In this case, it might be better to share the same external load balancer, where we expose the service in the form of type: NodePort. Or better yet, deploy something like nginx-ingress-controller (or traefik) as a single NodePort endpoint exposed to this external load balancer and route traffic in the cluster based on Kubernetes ingress resources.

Other intra-cluster (micro) services that communicate with each other can communicate through ClusterIP services and out-of-the-box DNS service discovery. Be careful not to use their public DNS/IP, as this may affect their latency and cloud costs.

4. Automatic scaling of clusters without Kubernetes perception

When adding or removing nodes from a cluster, you should not consider some simple metrics, such as the CPU utilization of these nodes. When scheduling Pod, we need to make decisions according to many scheduling constraints, such as Pod and node intimacy (affinities), taints and tolerations (tolerations), resource request (resource requests), QoS and so on. It can be troublesome to have an external auto-scaler (autoscaler) that doesn't understand these constraints to handle scaling.

Suppose a new Pod is scheduled, but all available CPU is requested and the Pod is stuck in the Pending state. However, the external auto-scaler looks at the current average CPU usage (not the number of requests) and decides not to expand (no new nodes are added). As a result, Pod will not be scheduled.

Downsizing (removing nodes from the cluster) is always more difficult. Suppose we have a stateful Pod (with persistent volumes attached), because persistent volumes (persistent volumes) are usually resources that belong to a specific availability zone and are not replicated in that zone, our custom autoscaler deletes a node with this Pod, and the scheduler cannot schedule it to another node, because the Pod can only stay in the availability zone where the persistent disk is located. Pod will fall into the Pending state again.

Cluster-autoscaler, which is widely used by the community, runs in clusters and integrates with most of the major public cloud providers API; it understands all these constraints and can scale up in these situations. It can also figure out whether we can scale down gracefully without affecting any constraints we set, thereby saving our computational costs.

Https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler

5. Do not use the power of IAM/RBAC

Instead of using IAM Users to permanently store the keys of machines and applications, use temporary keys generated by roles and service accounts.

We often see that access (access) and key (secret) are hard-coded in the application configuration, and keys are never rotated when using Cloud IAM. We should try to use the IAM role and service account instead of Users.

Please skip kube2iam and use the IAM role of the service account as described in this blog post by accountaccounp á n Vran arrow.

Https://blog.pipetail.io/posts/2020-04-13-more-eks-tips/

ApiVersion: v1kind: ServiceAccountmetadata:annotations:eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-app-rolename: my-serviceaccountnamespace: default

There is only one annotation. It's not that hard to do.

In addition, do not give admin and cluster-admin permissions to service accounts or instance profiles when they are not required. It's a little difficult, especially in K8s RBAC, but it's still worth a try.

6. Self anti-affinities of Pod

A deployment has three copies of Pod running, and then the node shuts down, and so does all the replicas. how absurd? All copies run on one node? Shouldn't Kubernetes be very powerful and provide high availability?!

We cannot expect the Kubernetes scheduler to force the use of anti-affinites for our Pod. We must define them explicitly.

/ / omitted for brevitylabels:app: zk// omitted for brevityaffinity:podAntiAffinity:requiredDuringSchedulingIgnoredDuringExecution:- labelSelector:matchExpressions:- key: "app" operator: Invalues:- zktopologyKey: "kubernetes.io/hostname"

okay. This ensures that the Pod is scheduled to different nodes (this is checked only during scheduling, not at execution time, so requiredDuringSchedulingIgnoredDuringExecution is required).

We are talking about podAntiAffinity on different node names (topologyKey: "kubernetes.io/hostname"), not podAntiAffinity for different availability zones. If you really need a good level of usability, you can do more in-depth research on this topic.

7. No PodDisruptionBudget

We run the production load on Kubernetes. Our nodes and clusters must be upgraded or deactivated from time to time. PodDisruptionBudget (pdb) is an API used to provide service assurance between cluster administrators and cluster users.

Make sure that a pdb is created to avoid unnecessary service disruptions due to node exhaustion.

ApiVersion: policy/v1beta1kind: PodDisruptionBudgetmetadata:name: zk-pdbspec:minAvailable: 2selector:matchLabels:app: zookeeper

As a cluster user, we can tell the cluster administrator, "Hey, I have a zookeeper service here, and I want at least two copies to be available all the time anyway."

I discussed this topic in more depth in this blog post.

Https://blog.marekbartik.com/posts/2018-06-29_kubernetes-in-production-poddisruptionbudget/

8. There is more than one tenant or environment in a shared cluster

The Kubernetes namespace does not provide any strong isolation.

It seems to be expected that if non-production loads are placed in a namespace and then production loads are placed in the production namespace, these loads will never affect each other. To some extent, we can fairly distribute (such as resource requests and restrictions, quotas, priorities) and achieve isolation (such as affinities, tolerations, taints, or nodeselectors), thus "physically" separating the load on the data plane, but this separation is quite complex.

If we need to have both types of load in the same cluster, we have to bear this complexity. If we don't have to be confined to one cluster, and when the cost of adding another cluster is lower (such as on a public cloud), then we should put them in different clusters to get a stronger isolation level.

9. ExternalTrafficPolicy: Cluster

It is common to see that all traffic is routed to a NodePort service within the cluster, which uses externalTrafficPolicy: Cluster by default. This means that NodePort is turned on on each node in the cluster so that we can choose one to communicate with the desired services (a set of Pod).

Typically, the Pod that NodePort services target actually runs on only a subset of these nodes. This means that if I communicate with a node that is not running Pod, it will forward traffic to another node, resulting in additional network hops and increased latency (if the nodes are in different AZs or data centers, the latency can be high and incur additional egress costs).

Setting externalTrafficPolicy: Local on the Kubernetes service does not open NodePort on every node, only on the node on which Pod is actually running. If we use an external load balancer to check the health of its endpoints (as AWS ELB does), it will only send traffic to the nodes that should receive it, thus improving latency, reducing computing overhead, reducing egress costs, and improving health.

We may have something like traefik or nginx-ingress-controller exposed as NodePort (or using NodePort's load balancer) to handle ingress HTTP traffic routing, and this setting can greatly reduce the latency of such requests.

Here is a great blog post that delves deeper into externalTrafficPolicy and their trade-offs.

Https://www.asykim.com/blog/deep-dive-into-kubernetes-external-traffic-policies

10. Treat the cluster as a pet + excessive pressure on the control plane

Have you ever given a server a name like Anton, HAL9000 or Colossus, or randomly generated id for a node but given a meaningful name to the cluster?

It may also be such an experience: at first, we used Kubernetes to do a proof of concept and named the cluster "testing". As a result, no one dared to touch it when it was not renamed in the production environment. (true story)

Treating the cluster as a pet is no joke, we may need to delete the cluster from time to time, practice disaster recovery and manage our control plane. Fear of touching the control plane is not a good sign. Did Etcd hang up? All right, we got a problem.

On the other hand, the control plane should not be overused. Maybe as time goes by, the control plane slows down. This is probably because we create a lot of objects without rotating them (as is common when using helm, its default setting does not rotate the state of configmaps/secrets, and as a result we have thousands of objects in the control plane), or because we constantly delete and edit a lot of content from kube-api (for auto-scaling, CI/CD, monitoring, event logs, controllers, etc.).

Also, check the "SLAs" / SLOs and warranty provided by the managed Kubernetes. The vendor may guarantee the availability of the control plane (or its subcomponents), but not the level of p99 latency for requests sent to it. In other words, even if it took us 10 minutes to get the correct result after kubectl get nodes, it didn't violate the service guarantee.

11. One complimentary note: use latest tags

This one is very classic. I don't think it's that common these days, because people have been cheated so many times that they don't have to: latest, start adding the version number. It's quiet now!

ECR has a powerful feature of tag invariance, which is definitely worth a try.

Https://aws.amazon.com/about-aws/whats-new/2019/07/amazon-ecr-now-supports-immutable-image-tags/

twelve。 Summary

Don't expect all problems to be solved automatically-Kubernetes is not a silver bullet. Even on Kubernetes, a bad application can be a bad application (in fact, it could be even worse). If we are not careful, we will end up with a series of problems: too much complexity, too much stress, slower control planes, and no disaster recovery strategy. Don't expect multi-tenancy and high availability to be out of the box. Please take a moment to make our application Yunyuan biochemical.

Original address: https://www.linuxprobe.com/kubernetes-10-error.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.