Cluster-proportional-autoscaler Source Code Analysis and how to solve the bottleneck of KubeDNS performance 07/01 Update SLTechnology News&Howtos

Cluster-proportional-autoscaler Source Code Analysis and how to solve the bottleneck of KubeDNS performance

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Cluster-proportional-autoscaler source code analysis and how to solve the KubeDNS performance bottleneck, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

Working mechanism

Cluster-proportional-autoscaler is one of the incubation projects of kubernetes, which is used to specify target under namespace (only RC, RS and Deployment are supported) according to the dynamic expansion and reduction of cluster size, but not StatefulSet. At present, there are only two autoscale modes, one is linear, the other is ladder, you can easily customize the development of new patterns, the code interface is very clear.

The working mechanism of cluster-proportional-autoscaler is very simple. The following operations are repeated at regular intervals (through-- poll-period-seconds configuration, default is 10s):

Count ScheduableNodes and ScheduableCores in a cluster

Get the latest configmap data from apiserver

Parse the configmap parameters according to the corresponding autoscale mode

Calculate the number of new expected copies according to the corresponding autoscale model

If the expected number of copies is different from the previous one, the Scale API is called to trigger AutoScale.

Configuration description

Cluster-proportional-autoscaler has a total of six flag items:

-- namespace: the namespace of the object to be autoscale

-- target: the object to be autoscale, which only supports deployment/replicationcontroller/replicaset and is case-insensitive

-- configmap: the configmap created by the configuration implementation, which stores the mode to be used and its configuration. There will be a specific example later.

-- default-params: if the configmap configured in-- configmap does not exist or is later deleted, use this configuration to create a new configmap. It is recommended to configure it.

-- poll-period-seconds: check period. Default is 10s.

-- version: print vesion and exit.

Source code analysis pollAPIServerpkg/autoscaler/autoscaler_server.go:82func (s * AutoScaler) pollAPIServer () {/ / Query the apiserver for the cluster status-number of nodes and cores clusterStatus, err: = s.k8sClient.GetClusterStatus () if err! = nil {glog.Errorf ("Error while getting cluster status:% v", err) return} glog.V (4). Infof ("Total nodes% 5d" Schedulable nodes:% 5d ", clusterStatus.TotalNodes, clusterStatus.SchedulableNodes) glog.V (4). Infof (" Total cores% 5d, schedulable cores:% 5d ", clusterStatus.TotalCores, clusterStatus.SchedulableCores) / / Sync autoscaler ConfigMap with apiserver configMap, err: = s.syncConfigWithServer () if err! = nil | configMap = = nil {glog.Errorf (" Error syncing configMap with apiserver:% v ") Err) return} / / Only sync updated ConfigMap or before controller is set. If s.controller = = nil | | configMap.ObjectMeta.ResourceVersion! = s.controller.GetParamsVersion () {/ / Ensure corresponding controller type and scaling params. S.controller, err = plugin.EnsureController (s.controller, configMap) if err! = nil | | s.controller = = nil {glog.Errorf ("Error ensuring controller:% v", err) return}} / / Query the controller for the expected replicas number expReplicas Err: = s.controller.GetExpectedReplicas (clusterStatus) if err! = nil {glog.Errorf ("Error calculating expected replicas number:% v", err) return} glog.V (4). Infof ("Expected replica count:% 3D", expReplicas) / / Update resource target with expected replicas. _, err = s.k8sClient.UpdateReplicas (expReplicas) if err! = nil {glog.Errorf ("Update failure:% s", err)}} GetClusterStatus

GetClusterStatus is used to count SchedulableNodes and SchedulableCores in the cluster, and to calculate the number of expected replicas later.

Pkg/autoscaler/k8sclient/k8sclient.go:142func (k * k8sClient) GetClusterStatus () (clusterStatus * ClusterStatus, err error) {opt: = metav1.ListOptions {Watch: false} nodes, err: = k.clientset.CoreV1 (). Nodes (). List (opt) if err! = nil | nodes = = nil {return nil Err} clusterStatus = & ClusterStatus {} clusterStatus.TotalNodes = int32 (len (nodes.Items)) var tc resource.Quantity var sc resource.Quantity for _ Node: = range nodes.Items {tc.Add (node.Status.Capacity[ apiv1.ResourceCPU]) if! node.Spec.Unschedulable {clusterStatus.SchedulableNodes++ sc.Add (node.Status.Capacity[ apiv1.ResourceCPU])}} tcInt64, tcOk: = tc.AsInt64 () scInt64 ScOk: = sc.AsInt64 () if! tcOk |! scOk {return nil, fmt.Errorf ("unable to compute integer values of schedulable cores in the cluster")} clusterStatus.TotalCores = int32 (tcInt64) clusterStatus.SchedulableCores = int32 (scInt64) k.clusterStatus = clusterStatus return clusterStatus, nil}

When counting the number of Nodes, those Unschedulable Nodes will be removed.

When counting the number of Cores, the corresponding Capacity of those Unschedulable Nodes will be subtracted.

Note that when calculating Cores, the Capacity of Node is counted, not Allocatable.

I think it's better to use Allocatable than Capacity.

The difference between the two will be reflected in large-scale clusters. For example, if each Node Allocatable has less 1c4g than Capacity, then when there are 2K Node clusters, there will be a difference in 2000c8000g, which will make a big difference in target object number.

Some students may ask: what is the difference between Node Allocatable and Capacity?

Capacity is all the resources provided by the Node hardware level. How much memory the server configures, the number of cpu cores and so on, are determined by the hardware.

On the other hand, Allocatable subtracts the size of kube-resreved and system-reserved resources configured in kubelet flag from Capacity, which is the number of resources that Kubernetes can really allocate to the application.

SyncConfigWithServer

SyncConfigWithServer mainly obtains the latest configmap data from apiserver. Note that instead of going to watch configmap here, you go to get regularly according to-- poll-period-seconds (default is 10s), so there will be a delay of up to 10s by default.

Pkg/autoscaler/autoscaler_server.go:124func (s * AutoScaler) syncConfigWithServer () (* apiv1.ConfigMap, error) {/ / Fetch autoscaler ConfigMap data from apiserver configMap, err: = s.k8sClient.FetchConfigMap (s.k8sClient.GetNamespace (), s.configMapName) if err = = nil {return configMap, nil} if s.defaultParams = = nil {return nil Err} glog.V (0) .Infof ("ConfigMap not found:% v, will create one with default params", err) configMap, err = s.k8sClient.CreateConfigMap (s.k8sClient.GetNamespace (), s.configMapName, s.defaultParams) if err! = nil {return nil, err} return configMap, nil}

If the configured-- configmap already exists in the cluster, get the latest configmap from apiserver and return

If the configured-- configmap does not exist in the cluster, then create a configmap based on the contents of-- default-params and return

If the configured-- configmap does not exist in the cluster, and-- default-params is not configured, it returns nil, which means failure, and the whole process is over. Please pay attention when using it!

It is recommended to configure-- default-params, because-- the configmap configured by configmap may be deleted by the administrator / user intentionally or unintentionally, and you do not configure-- default-params, then pollAPIServer will end at this time, because you do not achieve the goal of autoscale target, the key is that you may not know the cluster at this time.

EnsureController

EnsureController is used to create the corresponding Controller and parse parameters according to the controller type configured in configmap.

Pkg/autoscaler/controller/plugin/plugin.go:32// EnsureController ensures controller type and scaling paramsfunc EnsureController (cont controller.Controller, configMap * apiv1.ConfigMap) (controller.Controller, error) {/ / Expect only one entry, which uses the name of control mode as the key if len (configMap.Data)! = 1 {return nil, fmt.Errorf ("invalid configMap format, expected only one entry, got:% v" ConfigMap.Data)} for mode: = range configMap.Data {/ / No need to reset controller if control pattern doesn't change if cont! = nil & & mode = = cont.GetControllerType () {break} switch mode {case laddercontroller.ControllerType: Cont = laddercontroller.NewLadderController () case linearcontroller.ControllerType: cont = linearcontroller.NewLinearController () default: return nil Fmt.Errorf ("not a supported control mode:% v", mode)} glog.V (1). Infof ("Set control mode to% v", mode)} / / Sync config with controller if err: = cont.SyncConfig (configMap) Err! = nil {return nil, fmt.Errorf ("Error syncing configMap with controller:% v", err)} return cont, nil}

Check whether there is only one entry in the configmap data. If not, the configmap is illegal and the process ends.

Check whether the type of controller is one of linear or ladder, and call the corresponding method to create the corresponding Controller, otherwise a failure is returned.

Linear-- > NewLinearController

Ladder-- > NewLadderController

Call the SyncConfig of the corresponding Controller to parse the parameters in the configmap data and update the configmap ResourceVersion to the Controller object

GetExpectedReplicas

Linear and ladder Controller respectively implement their own GetExpectedReplicas methods to calculate the number of replicas that should be expected to be monitored this time. See the following section on Linear Controller and Ladder Controller for details.

UpdateReplicas

UpdateReplicas calls the expected number of copies calculated by GetExpectedReplicas by calling the Scale API corresponding to target (rc/rs/deploy), and Scale completes the capacity reduction and expansion of target.

Pkg/autoscaler/k8sclient/k8sclient.go:172func (k * k8sClient) UpdateReplicas (expReplicas int32) (prevRelicas int32, err error) {scale, err: = k.clientset.Extensions () .Scales (k.target.namespace) .Get (k.target.kind, k.target.name) if err! = nil {return 0 Err} prevRelicas = scale.Spec.Replicas if expReplicas! = prevRelicas {glog.V (0). Infof ("Cluster status: SchedulableNodes [% v], SchedulableCores [% v]", k.clusterStatus.SchedulableNodes, k.clusterStatus.SchedulableCores) glog.V (0) .Infof ("Replicas are not as expected: updating replicas from% d to% d", prevRelicas ExpReplicas) scale.Spec.Replicas = expReplicas _, err = k.clientset.Extensions () .Scales (k.target.namespace) .Update (k.target.kind, scale) if err! = nil {return 0, err}} return prevRelicas, nil}

The following is a code analysis of the implementation of Linear Controller and Ladder Controller.

Linear Controller

Let's first take a look at the parameters of linear Controller:

Pkg/autoscaler/controller/linearcontroller/linear_controller.go:50type linearParams struct {CoresPerReplica float64 `json: "coresPerReplica" `NodesPerReplica float64 `json: "nodesPerReplica" `Min int `json: "min" `Max int `json: "max" `PreventSinglePointFailure bool `json: "preventSinglePointFailure" `}

When writing configmap, refer to the following:

Kind: ConfigMapapiVersion: v1metadata: name: nginx-autoscaler namespace: defaultdata: linear:-{"coresPerReplica": 2, "nodesPerReplica": 1, "preventSinglePointFailure": true, "min": 1, "max": 100}

Not to mention other parameters, I would like to mention PreventSinglePointFailure, which literally means to prevent a single point of failure, is a Bool value, there is no initialization shown in the code, which means the default is false. You can set "preventSinglePointFailure": true in the corresponding configmap data or dafault-params, but if it is set to true, if schedulableNodes > 1, it will ensure that the target's replicas is at least 2, that is, to prevent a single point of failure of target.

Pkg/autoscaler/controller/linearcontroller/linear_controller.go:101func (c * LinearController) GetExpectedReplicas (status * k8sclient.ClusterStatus) (int32, error) {/ / Get the expected replicas for the currently schedulable nodes and cores expReplicas: = int32 (c.getExpectedReplicasFromParams (int (status.SchedulableNodes), int (status.SchedulableCores)) return expReplicas, nil} func (c * LinearController) getExpectedReplicasFromParams (schedulableNodes, schedulableCores int) int {replicasFromCore: = c.getExpectedReplicasFromParam (schedulableCores) C.params.CoresPerReplica) replicasFromNode: = c.getExpectedReplicasFromParam (schedulableNodes, c.params.NodesPerReplica) / / Prevent single point of failure by having at least 2 replicas when / / there are more than one node. If c.params.PreventSinglePointFailure & & schedulableNodes > 1 & & replicasFromNode

< 2 { replicasFromNode = 2 } // Returns the results which yields the most replicas if replicasFromCore >

ReplicasFromNode {return replicasFromCore} return replicasFromNode} func (c * LinearController) getExpectedReplicasFromParam (schedulableResources int, resourcesPerReplica float64) int {if resourcesPerReplica = = 0 {return 1} res: = math.Ceil (float64 (schedulableResources) / resourcesPerReplica) if c.params.Max! = 0 {res = math.Min (float64 (c.params.Max)) Res)} return int (math.Max (float64 (c.params.Min), res))}

According to the CoresPerReplica in schedulableCores and configmap, the replicasFromCore is calculated according to the following formula

ReplicasFromCore = ceil (schedulableCores * 1/CoresPerReplica)

According to the NodesPerReplica in schedulableNodes and configmap, the replicasFromNode is calculated according to the following formula

ReplicasFromNode = ceil (schedulableNodes * 1/NodesPerReplica))

If min or max is configured in configmap, you must ensure that replicas is within the range of min and max

Replicas = min (replicas, max)

Replicas = max (replicas, min)

If preventSinglePointFailure is true and schedulableNodes > 1 is configured, the replicasFromNode must be greater than 2 to prevent a single point of failure according to the previously mentioned logic

ReplicasFromNode = max (2, replicasFromNode)

Returns the largest of replicasFromNode and replicasFromCore as the expected number of replicas.

To sum up, linear controller's formula for calculating replicas is:

Replicas = max (ceil (cores * 1/coresPerReplica), ceil (nodes * 1/nodesPerReplica)) replicas = min (replicas, max) replicas = max (replicas, min) Ladder Controller

The following is the parameter structure of ladder Controller:

Pkg/autoscaler/controller/laddercontroller/ladder_controller.go:66type paramEntry [2] inttype paramEntries [] paramEntrytype ladderParams struct {CoresToReplicas paramEntries `json: "coresToReplicas" `NodesToReplicas paramEntries `json: "nodesToReplicas" `}

When writing configmap, refer to the following:

Kind: ConfigMapapiVersion: v1metadata: name: nginx-autoscaler namespace: defaultdata: ladder:-{"coresToReplicas": [[1rect 1], [3je 3], [256je 4], [512je 5], [1024pr 7]], "nodesToReplicas": [[1jue 1], [2jue 2], [100,5], [200i] 12]]}

The following is the method for calculating the expected replica value for ladder Controller.

Func (c * LadderController) GetExpectedReplicas (status * k8sclient.ClusterStatus) (int32, error) {/ / Get the expected replicas for the currently schedulable nodes and cores expReplicas: = int32 (c.getExpectedReplicasFromParams (int (status.SchedulableNodes), int (status.SchedulableCores)) return expReplicas, nil} func (c * LadderController) getExpectedReplicasFromParams (schedulableNodes, schedulableCores int) int {replicasFromCore: = getExpectedReplicasFromEntries (schedulableCores, c.params.CoresToReplicas) replicasFromNode: = getExpectedReplicasFromEntries (schedulableNodes C.params.NodesToReplicas) / / Returns the results which yields the most replicas if replicasFromCore > replicasFromNode {return replicasFromCore} return replicasFromNode} func getExpectedReplicasFromEntries (schedulableResources int, entries [] paramEntry) int {if len (entries) = 0 {return 1} / / Binary search for the corresponding replicas number pos: = sort.Search (len (entries) Func (I int) bool {return schedulableResources

< entries[i][0] }) if pos >

0 {pos = pos-1} return entries [pos] [1]}

Select the preset number of expected copies according to the range defined by schedulableCores in the CoresToReplicas in configmap.

Select the preset number of expected copies according to the range defined by schedulableNodes in the NodesToReplicas in configmap.

Returns the largest of the above two as the expected number of copies.

Note:

In ladder mode, there is no setting item to prevent single point of failure. Users should pay attention to it when configuring configmap.

In ladder mode, if there is no NodesToReplicas or the configuration corresponding to CoresToReplicas is empty, the corresponding replicas is set to 1

For example, in the example of configmap, if schedulableCores=400 (4 for expected replica) and schedulableNodes=120 (5 for expected replica) are in the cluster, the final number of expected copies is 5. 5.

Using kube-dns-autoscaler to solve KubeDNS performance bottleneck

Create kube-dns-autoscaler Deployment and configmap through the following yaml file. Kube-dns-autoscaler will check the number of copies every 30 seconds, and AutoScale may be triggered.

Kind: ConfigMapapiVersion: v1metadata: name: kube-dns-autoscaler namespace: kube-systemdata: linear: {"nodesPerReplica": 10, "min": 1, "max": 50 "preventSinglePointFailure": true}-apiVersion: extensions/v1beta1kind: Deploymentmetadata: name: kube-dns-autoscaler namespace: kube-systemspec: template: metadata: labels: k8s-app: kube-dns-autoscaler spec: imagePullSecrets:-name: harborsecret containers:-name: autoscaler image: registry.vivo.xyz:4443/bigdata_release/cluster_proportional_autoscaler_amd64:1.0.0 Resources: requests: cpu: "50m" memory: "100Mi" command:-/ cluster-proportional-autoscaler-- namespace=kube-system-- configmap=kube-dns-autoscaler-- target=Deployment/kube-dns-default-params= {"linear": {"nodesPerReplica": 10 "min": 1}}-logtostderr=true-Venture 2 Summary and Prospect

The cluster-proportional-autoscaler code is very simple and the working mechanism is very simple. We hope to use it to dynamically expand KubeDNS according to the size of the cluster to solve the large-scale domain name resolution performance problems in the TensorFlow on Kubernetes project.

Currently, it only supports autoscale based on SchedulableNodes and SchedulableCores. In the scenario of AI, cluster resources are extremely squeezed, and the svc and pod of a cluster fluctuate widely. Later, we may develop controller based on service number to autoscale kubedns.

In addition, I also consider isolating the deployment of KubeDNS from the AI training server, because the server cpu is often run to more than 95% during training, and if KubeDNS is also deployed on this server, it is bound to affect KubeDNS performance.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.