What are the relevant knowledge points of Borg architecture 07/01 Update SLTechnology News&Howtos

What are the relevant knowledge points of Borg architecture

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I would like to share with you the relevant knowledge points of Borg architecture, which are detailed in content and clear in logic. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article.

Borg architecture

A Borg Cell consists of a bunch of machines, a logical central control service called Borgmaster, and a Borglet agent process running on each machine (see figure 1). All the components of Borg are written in C++.

1 Borgmaster

Cell's Borgmaster consists of two processes, the main Borgmaster process and a separate scheduler ($3.20). The main Borgmaster handles all client RPC requests, such as modifying status (creating a job) and providing data reading services (lookup job). It also manages the state machines of all components in the system (machines, task, allocs, and so on), communicates with Borglet, and provides a backup Web UI of Sigma.

Borgmaster is logically a single process, but it actually opens five copies. Each replica maintains a memory-level copy of the cell state, which is also recorded on a highly available, distributed, Paxos-based storage [55] on the local hard disk of these replicas. Within a cell, a separate elected master is used for both the Paxos leader and the state modifier to handle all requests to change the state of the cell, such as submitting a job or terminating a task on a machine. When cell starts or the previous master fails, the Paxos algorithm elects a master; that requires a Chubby lock and other systems can find the master. The typical event required to elect a master or replace it with a new one is 10s, but it takes about a minute for a large cell to take effect because some memory states need to be refactored. When a replica recovers from network isolation, it needs to dynamically resynchronize its state from other Paxos replicas.

The state of the Borgmaster at some point is called checkpoint and will be stored in Paxos storage in the form of snapshot + change log. Checkpoints has many uses, including restoring the state of Borgmaster to any previous time (for example, to resolve software defects before processing a request); in extreme cases, manually modifying checkpoints to form a continuous event log for future use; or for offline online simulations.

A highly emulated Borgmaster called Fauxmaster can be used to read checkpoint files, including a complete Borgmaster code, and Borglet stub interface. It accepts RPC to change the state machine and perform operations, such as scheduling all blocked tasks, we use it to debug errors, and its interaction is the same as Borgmaster interaction, and we also have a simulated Borglet that can replay real interactions with checkpoint. Users can step through debugging to see all past changes in the system. Fauxmaster is also useful in this case: is it appropriate to have multiple job of this type? And do a security check before changing the cell configuration (will this kick out any key jobs?)

2 scheduling schedule

When a job is submitted, Borgmaster stores it persistently on Paxos storage, and then puts the task of the job into the waiting queue (pending). The queue is asynchronously scanned by scheduler and task is distributed to machines with sufficient resources. Scheduler mainly deals with task, not job. Scanning from high priority to low priority is processed by round-robin on the same priority to ensure fairness between users and avoid large job blocking on the head. The scheduling algorithm has two parts: feasibility check (feasibility checking), finding a machine that can run task, and scoring (scoring) to find the most suitable machine.

During the feasibility check phase, scheduler finds a set of machines that meet the constraints of task and have sufficient resources available-including resources that can be freed up that have been assigned to low-priority tasks. In the scoring phase, scheduler will find the "best" machine. This score includes user preferences, but is mainly based on built-in criteria: such as minimizing flipping other task, finding existing task packages, spreading as much as possible between power and error availability, mixing high and low priority task on a single machine to ensure peak expansion.

Borg originally used the variant algorithm of E-PVM [4] to score, generating a single score on heterogeneous resources and minimizing system changes when scheduling a task. But in practice, E-PVM ends up distributing the load evenly across all machines, leaving expansion space for peak hours-but at the cost of increasing fragmentation, especially when most machines are needed for large task; we sometimes nickname this allocation "worst match".

At the other end of the spectrum of allocation strategies is the "best match", which plugs the machine as tightly as possible. This leaves some empty machines for users jobs (they also run storage services), so dealing with large task is straightforward, but tight allocation punishes users who underestimate the resources they need. This strategy can hurt explosive load applications and is particularly unfriendly to batch tasks that require low CPU, which can be easily scheduled to unused resources: 20% of non-prod task requires less than 0.1 core CPU.

Our current scoring model is a hybrid, trying to reduce stranded resources-some left because the resources on this machine are not fully used. Compared to the "best match", this model provides a 3% improvement in packaging efficiency (defined in [78]).

If a machine does not have enough resources to run the new task,Borg after scoring, it will expel (preempts) low-priority tasks and kick up from the lowest priority until there are enough resources. We put the kicked task into the scheduler pending queue instead of migrating or hibernating the task.

The task startup delay (the time between job commit and task running) is a constant concern. This time is very different, generally speaking, it is 25s. Package installation takes 80% of the time: a known bottleneck is the scramble for local hard drives. To reduce task startup time, scheduler hopes that there are enough packages (programs and data) on the machine: most packages are read-only so they can be shared and cached. This is the only way to localize data that Borg scheduler supports. By the way, the way Borg distributes packages to machines is tree-shaped and BT-type protocols.

In addition, scheduler uses some technology to spread to the cell of tens of thousands of machines. ($3.4)

3 Borglet

Borglet is a local Borg agent deployed on each machine in cell. It starts and stops task; and restarts if task fails; manages local resources by modifying OS kernel settings; scrolls debug logs; and reports native status to Borgmaster and other monitoring systems.

Borgmaster polls all Borglet every few seconds to get the current state of the machine and to send any requests. This allows Borgmaster to control AC frequency, avoid an explicit flow control mechanism, and prevent recovery storms [9].

The elected master is responsible for sending messages to the Borglet and updating the status of the cell based on the response. For performance scalability, each Borgmaster replica runs a stateless connection allocation (link shard) to handle communication with specific Borglet; this allocation is recalculated when the Borgmaster election occurs. To ensure flexibility, Borglet reports all states, but link shard aggregates and compresses this information into the state machine to reduce the load on the elected master.

If the Borglet does not respond to the polling request several times, it will be marked as down, and then the task running above will be reassigned to another machine. If the communication is restored, Borgmaster will ask the Borglet to kill the assigned task to avoid repetition. Borglet will continue its regular operations even if it resumes contact with Borgmaster, so the current running task and service keep running in case all Borgmaster is dead.

4 scalability

We don't know what the scalability limit of Borg is, and every time we hit a limit, we cross it. A single Borgmaster can manage thousands of machines in a cell, and several cell can handle 10000 tasks per minute. A busy Borgmaster uses 10-14 CPU cores and 50GB memory. We used several technologies to achieve this scalability.

Early Borgmaster had a simple, synchronous loop to process requests, schedule tasks, and communicate with Borglet. In order to deal with large cell, we separate the scheduler as a separate process, and then we can run in parallel with other Borgmaster functions, and other Borgmaster can be fault-tolerant. One copy of scheduler operates a status copy of cell. It repeats: getting state changes from the elected master (including all assigned and pending work); updating its own local copy, doing scheduling work to allocate these assignments that task; tells the elected master. Master will take the information and apply it, unless the information is not appropriate (for example, out of date) and will be processed in the next loop of scheduler. All of this is in line with the optimistic parallel strategy of Omega [69], and we have recently really added this functionality to Borg, using different scheduler to schedule different workloads.

In order to improve the response time, we added some independent threads to communicate with Borglet and respond to read-only RPC. For higher performance, we share (partition) these requests to 5 copies of Borgmaster for $3.30. Finally, this leaves 99 per cent of UI responses within 1s and 95 per cent of Borglet polls within 10s.

Some things that make Borg scheduler more extensible:

Score caching: it is expensive to evaluate the availability and scores of a machine, so Borg caches scores until the machine or task changes-for example, the task on this machine ends, some properties change, or the requirements of task change. Ignore small resource changes to make the cache shelf life longer.

Homogenization at the same level: task with the same Borg job generally has the same requirements and resources, so there is no need for waiting task to find available machines every time, which will score all available machines n times. Borg will score the same level of task once with the available machine.

Moderately random: it is wasteful to measure availability and score all the machines in a large Cell. So scheduler will randomly check the machines, find enough available machines to score, and then pick out the best one. This reduces the number of scores and cache invalidation when task enters and leaves the system. Moderate randomization is a bit like the batch sampling technique of Sparrow [65], which also faces the costs of priority, eviction, non-homogeneous systems, and package installation.

In our experiment ($5), it took hundreds of seconds to schedule the entire cell workload, but it would take more than 3 days without the above techniques. Generally speaking, an online scheduling can be done in half a second from the waiting queue.

Usability

Failures are common in large distributed systems. Figure 3 shows the reasons for task eviction in 15 cell. Applications running on Borg need to be able to handle such events, support open copies, store data to distributed storage, and take snapshots on a regular basis. Even so, we try our best to mitigate the impact of these events. For example, Borg:

Automatically reschedule expelled task, if necessary, to run on a new machine

By dispersing a job into different availability domains, such as machines, racks, power supply domains

When upgrading maintainable jobs such as machines and OS, reduce the closure rate of task in a job at the same time

Use declarative target state representations and idempotent state changes so that failed clients can restart without damage or safely forget requests

For task on missing machines, limit a certain rate to rescheduling, because it is difficult to distinguish between large-scale machine failures and network partitions

Avoid specific task that causes crashes: machine matching

It is important that intermediate data at the critical level be written to the log of the local hard disk to save the task, even if the alloc to which the task belongs is terminated or dispatched to another machine. Users can set how long the system keeps repeating attempts: a few days is more reasonable.

A key Borg design feature is that task will continue to run even if Borgmaster or Borglet is down. However, it is also important to keep the master running because the new jobs cannot be committed when it hangs, or the finished jobs cannot be updated, and the task on the failed machine cannot be rescheduled.

Borgmaster uses combined technologies to ensure 99.99% availability in practice: replica technology to deal with machine failures; management control to deal with overload; and simple, low-level tools to reduce external dependencies when deploying instances. (translator: I guess it's rsync or scp). Each cell and other cell are independent, which reduces misoperation correlations and fault contagion. In order to achieve this goal, we do not build a big cell.

These are all the contents of this article entitled "what are the relevant knowledge points of Borg architecture?" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.