What are the problems and solutions encountered in scaling Kubernetes to 2500 nodes 07/12 Update SLTechnology News&Howtos

What are the problems and solutions encountered in scaling Kubernetes to 2500 nodes

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the problems and solutions encountered in scaling Kubernetes to 2500 nodes, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.

Kubernetes has been claimed to be able to carry more than 5000 nodes since 1.6. but it is hard to avoid problems on the way from tens to 5000.

The problem encountered and how to solve the problem 1: after 1 ~ 500 nodes

Question:

Kubectl sometimes appears timeout (p. S. Kubectl-vault 6 can display all API detail instructions)

Try to resolve:

At first, I thought it was the load of the kube-apiserver server, so I tried to add proxy to do replica to help with load balancing.

However, when there are more than 10 backup master, it is not because the kube-apiserver cannot bear the load. The GKE can carry 500 nodes through a single 32-core VM.

Reason:

Eliminate the above reasons and start troubleshooting the remaining services on master (etcd, kube-proxy)

Start trying to adjust the etcd

Abnormal latency (latency spiking ~ 100ms) was found by using datadog to view etcd throughput

Through the performance evaluation of the Fio tool, it was found that only 10% of IOPS (Input/Output Per Second) was used, which degraded performance due to write latency (write latency 2ms).

Try to change the SSD from a network hard disk to a local temp drive (SSD) on each machine.

Results from ~ 100ms-> 200us

Question 2: when there are 1000 nodes

Question:

Found that kube-apiserver reads 500mb from etcd every second

Try to resolve:

View network traffic between container via Prometheus

Reason:

It is found that Fluentd and Datadog crawl data on each node too frequently.

Reduce the crawl frequency of the two services, and the network performance decreases from 500mb/s to almost zero.

Etcd tip: through-- etcd-servers-overrides, you can write the data of Kubernetes Event as a cut, which can be processed by different machines, as shown below

-- etcd-servers-overrides=/events# https://0.example.com:2381;https://1.example.com:2381;https://2.example.com:2381 question 3: 1000 ~ 2000 nodes

Question:

Unable to write data again, error cascading failure

Kubernetes-ec2-autoscaler does not return the problem until all etcd has been stopped, and all etcd is closed.

Try to resolve:

Guess etcd's hard drive is full, but there's still plenty of room to check SSD.

Check if there is a preset space limit and find that there is a 2GB size limit

Solution:

Add-- quota-backend-bytes to the etcd startup parameters

Modify kubernetes-ec2-autoscaler logic-if there is a problem with more than 50%, shut down the cluster

High availability of optimized Kube masters for various services

Generally speaking, our architecture is a kube-master (the main Kubernetes service provider component with kube-apiserver, kube-scheduler, and kube-control-manager on it) plus multiple slave. However, to achieve high availability, you should refer to the implementation method:

Kube-apiserver needs to set up multiple services, and restart and set with the parameter-- apiserver-count.

Kubernetes-ec2-autoscaler can help us turn off idle resources automatically, but this is contrary to the principles of Kubernetes scheduler, but through these settings, we can help us concentrate resources as much as possible.

{"kind": "Policy", "apiVersion": "v1", "predicates": [{"name": "GeneralPredicates"}, {"name": "MatchInterPodAffinity"}, {"name": "NoDiskConflict"}, {"name": "NoVolumeZoneConflict"}, {"name": "PodToleratesNodeTaints"}], "priorities": [{"name": "MostRequestedPriority", "weight": 1} {"name": "InterPodAffinityPriority", "weight": 2}]}

The above is to adjust the kubernetes scheduler example, by increasing the weight of InterPodAffinityPriority to achieve our goal. More demonstration and reference examples.

It should be noted that Kubernetes Scheduler Policy does not support dynamic switching, so you need to restart kube-apiserver (issue: 41600).

The impact of adjusting scheduler policy

OpenAI used KubeDNS, but soon found out--

Question:

Situations that can not be queried by DNS occur frequently (random occurrence)

Exceed ~ 200QPS domain lookup

Try to resolve:

Try to find out why there is such a state, and find that there are more than 10 KuberDNS running on some node

Solution:

Due to scheduler policy, many POD are concentrated.

KubeDNS is lightweight and easy to be assigned to the same node, resulting in the centralization of domain lookup

Need to modify POD affinity (related introduction) so that KubeDNS can be assigned to different node as far as possible.

Affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution:-weight: 100labelSelector: matchExpressions:-key: k8s-app operator: In values:-kube-dns topologyKey: kubernetes.io/hostname slow Docker image pulls when creating new nodes

Question:

Every time a new node is set up, it takes docker image pull 30 minutes.

Try to resolve:

There is a large container image Dota, almost 17GB, which affects the image pulling of the entire node.

Start checking if kubelet has other image pull options

Solution:

Add the option-- serialize-image-pulls=false to start image pulling in kubelet, so that other services can pull earlier (see: kubelet startup option)

This option requires docker storgae to switch to overlay2 (refer to docker teaching articles)

And storing docker image to SSD can make image pull faster.

Supplement: source trace

/ / serializeImagePulls when enabled, tells the Kubelet to pull images one// at a time. We recommend * not* changing the default value on nodes that// run docker daemon with version < 1.9or an Aufs storage backend.// Issue # 10959 has more details.SerializeImagePulls * bool `json: "serializeImagePulls" `to increase the speed of docker image pull

In addition, the speed of pull can be improved in the following ways

Kubelet parameter-image-pull-progress-deadline must be increased to 30mins docker daemon parameter max-concurrent-download adjusted to 10 before multithreaded download

Network performance improvement

Flannel performance limit

The network traffic between OpenAI nodes can reach 10-15GBit/s, but due to Flannel, the traffic will drop to ~ 2GBit/s.

The solution is to remove the Flannel and use the actual network

HostNetwork: true

DnsPolicy: ClusterFirstWithHostNet

So much for the problems and solutions encountered in scaling Kubernetes to 2500 nodes. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.