How to implement kubernetes scheduler based on map/reduce pattern 07/16 Update SLTechnology News&Howtos

How to implement kubernetes scheduler based on map/reduce pattern

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to carry out kubernetes scheduler based on map/reduce mode implementation, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, people who have this need can learn, I hope you can gain something.

In the optimization stage, parallel computation of multiple nodes and multiple algorithms is realized through map/reduce mode, and the final storage result is designed based on secondary index, so as to achieve lock-free design in the whole computation process. Meanwhile, in order to ensure the randomness of allocation, the final node allocation is carried out in a random way for the same priority. If you have similar requirements later, you may wish to learn from it.

1. Design Basics 1.1 Two Phases: Single Point and Aggregation

When optimizing, except for the last calculation, when performing calculations for individual algorithms, there are two stages: single point and aggregation

In the single-point phase, it is calculated for a single node according to the current algorithm. In the aggregation phase, it is aggregated according to the calculation of the current single-point phase.

1.2 Parallel: Nodes and Algorithms

The single-point and aggregation phases are parallel when calculating, but the objects are different. The single-point phase parallelism is for the calculation of a single node, while the aggregation phase is for the calculation of the algorithm level. Through this design, the calculation is separated, thus avoiding data competition among multiple goroutines and accelerating the optimal calculation without locking.

1.3 map and reduce

Map and reduce are two specific implementations of the above parallel, in which map is responsible for single node scoring, while reduce is to aggregate the scores in the map stage and perform secondary scoring calculation according to the summarized results.

1.4 weight

The map/reduce phases are calculated by algorithms. If we want to make custom adjustments, we can adjust the weights of individual algorithms in the preselection process to customize our preselection process.

1.5 randomly distributed

When priority judgment is made, there will definitely be a situation where multiple nodes have the same priority. When a node is preferred, random calculation will be performed to determine whether to replace the previous most suitable node with a node with the same priority.

2. source code analysis

The core process of optimization is mainly in PrioritizNodes, and only the key core data structure design is introduced here.

2.1 Lock free calculation results saved

The storage of lock-free calculation results is mainly realized by the following two-dimensional array. If you want to store the results of an algorithm for a node, you only need to use two indexes: the algorithm index and the node index. Similarly, if I assign the index for a single node to a goroutine, it can be calculated in parallel to other goroutines.

// 在计算的时候，会传入nodes []*v1.Node的数组，存储所有的节点，节点索引主要是指的该部分results := make([]schedulerapi.HostPriorityList, len(priorityConfigs), len(priorityConfigs))2.2 基于节点索引的Map计算

之前在预选阶段介绍过ParallelizeUntil函数的实现，其根据传入的数量来生成计算索引，放入chan中，后续多个goroutine从chan中取出数据直接进行计算即可

workqueue.ParallelizeUntil(context.TODO(), 16, len(nodes), func(index int) { // 根据节点和配置的算法进行计算 nodeInfo := nodeNameToInfo[nodes[index].Name] // 获取算法的索引 for i := range priorityConfigs { if priorityConfigs[i].Function != nil { continue } var err error // 通过节点索引，来进行针对单个node的计算结果的保存 results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo) if err != nil { appendError(err) results[i][index].Host = nodes[index].Name } } })2.3 基于算法索引的Reduce计算

基于算法的并行，则是为每个算法的计算都启动一个goroutine,每个goroutine通过算法索引来进行该算法的所有map阶段的结果的读取，并进行计算，后续结果仍然存储在对应的位置

// 计算策略的分值 for i := range priorityConfigs { if priorityConfigs[i].Reduce == nil { continue } wg.Add(1) go func(index int) { defer wg.Done() if err := priorityConfigs[index].Reduce(pod, meta, nodeNameToInfo, results[index]); err != nil { appendError(err) } if klog.V(10) { for _, hostPriority := range results[index] { klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), hostPriority.Host, priorityConfigs[index].Name, hostPriority.Score) } } }(i) } // Wait for all computations to be finished. wg.Wait()2.4 优先级打分结果统计

根据之前的map/reduce阶段，接下来就是将针对所有node的所有算法计算结果进行累加即可

// Summarize all scores. result := make(schedulerapi.HostPriorityList, 0, len(nodes)) for i := range nodes { result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0}) // 便利所有的算法配置 for j := range priorityConfigs { result[i].Score += results[j][i].Score * priorityConfigs[j].Weight } for j := range scoresMap { result[i].Score += scoresMap[j][i].Score } }2.5 根据优先级随机筛选host

这里的随机筛选是指的当多个host优先级相同的时候，会有一定的概率用当前的node替换之前的优先级相等的node(到目前为止的优先级最高的node), 其主要通过cntOfMaxScore和rand.Intn(cntOfMaxScore)来进行实现

func (g *genericScheduler) selectHost(priorityList schedulerapi.HostPriorityList) (string, error) { if len(priorityList) == 0 { return "", fmt.Errorf("empty priorityList") } maxScore := priorityList[0].Score selected := priorityList[0].Host cntOfMaxScore := 1 for _, hp := range priorityList[1:] { if hp.Score > maxScore { maxScore = hp.Score selected = hp.Host cntOfMaxScore = 1 } else if hp.Score == maxScore { cntOfMaxScore++ if rand.Intn(cntOfMaxScore) == 0 { // Replace the candidate with probability of 1/cntOfMaxScore selected = hp.Host } } } return selected, nil}3. 设计总结

优选阶段通过分map/reduce模式来实现多个node和多种算法的并行计算，并且通过基于二级索引来设计最终的存储结果，从而达到整个计算过程中的无锁设计，同时为了保证分配的随机性，针对同等优先级的采用了随机的方式来进行最终节点的分配，如果大家后续有类似的需求，不妨可以借鉴借鉴

看完上述内容是否对您有帮助呢？如果还想对相关知识有进一步的了解或阅读更多相关文章，请关注行业资讯频道，感谢您对的支持。

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.