A brief introduction to the Design of Task Manager 07/09 Update SLTechnology News&Howtos

A brief introduction to the Design of Task Manager

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Before I talk about Task Manager, I'll introduce some conceptual terms that Task Manager will use.

In the figure database Nebula Graph, there are some tasks that run in the background for a long time, which we call Job. Some of the instructions used by DBA in the storage layer, such as: after the data has been imported, want to do a global compaction, are all in the category of Job.

As a distributed system, the Job in Nebula Graph is accomplished by different storaged, and we call the Job subtask running on storaged Task. Job Manager on metad is responsible for the control of Job and Task Manager on storaged is responsible for the control of Task.

In this article, we focus on how to manage and schedule the time-consuming Task to further improve database performance.

The problem to be solved by Task Manager

As mentioned above, Task controlled by Task Manager on storaged is a subtask of Job controlled by meta, so what problem does Task Manager solve on its own? In Nebula Graph, Task Manager mainly solves the following two problems:

Change the previous transmission mode via HTTP to RPC (Thrift)

When building a cluster, ordinary users know that the communication between storaged uses Thrift protocol, which will open the firewall for the ports required by Thrift, but they may not realize that Nebula Graph still needs to use HTTP ports. We have encountered many cases where community users practice forgetting to open HTTP ports. Storaged has scheduling capability for Task

This piece of content will be discussed in the following chapters of this article. The position of Task Manager in Nebula Graph

Meta in Task Manager system

In the Task Manager system, the task of metad (JobManager) is to select the corresponding storaged host according to a Job Request transmitted from the graphd, and to assemble the Task Request and send it to the corresponding storaged. It is not difficult to find that the logic routines of meta accepting Job Request, spelling Task Request, sending Task Request and accepting Task to return results are stable. How to spell TaskRequest and which storaged to send TaskRequest to will vary according to different Job. JobManager uses template strategy + simple factory to cope with future expansion.

Let the future Job also inherit from MetaJobExecutor and implement the prepare () and execute () methods.

Scheduling Control of Task Manager

As mentioned earlier, Task Manager's scheduling control wants to do two things:

When the system resources are sufficient and the concurrent execution of Task system resources is as high as possible, when the system resources are tight, let all running Task take up resources not to exceed a set threshold. Highly concurrent execution of Task

Task Manager calls the thread that it holds in the system resource Worker. Task Manager has a real-life simulation prototype-the business hall of the bank. Imagine that when we go to a bank to do business, we will have the following steps:

Scene 1: get a number in the numbering machine at the door scene 2: find a place in the hall, play with your mobile phone and wait for dialing scene 3: when you get the number, go to the designated window.

At the same time, you will encounter problems of one kind or another:

Scenario 4:VIP can jump the queue scenario 5: you may be in line, for some reason, give up this business scenario 6: you may be waiting in line and the bank will be closed

So, sort it out, these are the basic requirements of Task Manager.

Task is executed in FIFO order: different Task has different priorities. Users with high priority can cancel a queued Taskstoraged and shutdown a Task at any time. In order to make it as concurrent as possible, it will be split into multiple SubTask,SubTask, which is the real task executed by each Worker. Task Manager is the only global instance, and multi-thread safety should be considered.

As a result, the implementation is as follows:

Implementation 1: use JobId and TaskId in the Thrift structure to determine a Task, called Task Handle. Implementing 2:TaskManager will have a Blocking Queue that is responsible for queuing Task's Handle to execute (the numbering machine), while Blocking Queue itself is thread-safe. Achieve 3:Blocking Queue to support different priorities at the same time, high priority first out (VIP queue-jumping function). To implement 4:Task Manager to maintain a globally unique Map,key is that Task Handle,value is the specific Task (the lobby of the bank). Folly's Concurrent Hash Map, thread-safe Map, is used in Nebula Graph. Implementation 5: if there is a user cancel Task, directly find the corresponding Task in the Map according to the Handle, mark cancel, and do no processing to the Handle in the queue. Implementation 6: if there is a running Task, the shutdown for the storaged will not return until the execution of the subTask that the Task is executing is complete. Limit the threshold of resources consumed by Task

It is easy to ensure that the threshold is not exceeded, because Worker is a thread, and as long as all Worker come from one thread pool, the maximum number of Worker can be guaranteed. The trouble is to distribute the subtasks evenly among the Worker. Let's discuss the solution:

Method 1: use Round-robin to add tasks

The easiest way is to add tasks in a Round-robin way. That is, after the Task is decomposed into Sub Task, it is appended to each Worker in turn.

But there may be problems, for example, I have 3 Worker and 2 Task (Task 1 in blue and Task 2 in yellow):

Round-robin figure 1

If Sub Task in Task 2 executes much faster than Task1, then a good parallel strategy would look like this:

Round-robin figure 2

Simple and crude Round-robin makes the completion time of Task 2 dependent on Task 1 (see Round-robin figure 1).

Method 2: a set of worker processes one Task

In view of the situation that may occur in method 1, set a special Worker to deal with only the specified Task, so as to avoid the problem of interdependence of multiple Task. But it's still not good enough, for example:

It is difficult to guarantee that the execution time of each Sub Task is basically the same. Assuming that the execution of Sub Task 1 is significantly slower than that of other Sub Task, a good execution strategy would look like this:

This proposal still cannot avoid the problem of nuclear difficulties and nuclear onlookers.

Method 3: the solution adopted by Nebula Graph

In Nebula Graph, Task Manager will give the Handle of Task to N Worker. N is determined by the total number of Worker, the total number of Sub Task, and the concurrency parameters specified by DBA when submitting the Job.

Each Task maintains an internal Blocking Queue (Sub Task Queue in the figure below) to store the Sub Task. When Worker executes, it first finds the Task according to the Handle it holds, and then obtains the Sub Task from the Block Queue of Task.

Supplementary instructions for design

Question1: why not just put the Task in the Blocking Queue queue, but split it into two parts, save the Task in the Map, and let the Task Handle queue?

The main reason is that C++ multithreaded infrastructure is not good to support this logic. Task needs to support cancel. Assuming that the Task is placed in the Blocking Queue, you need Blocking Queue to support the ability to navigate to one of the Task. Currently, none of the Blocking Queue in folly has such an interface.

Question 2: what kind of Job has VIP treatment?

Currently, compaction / rebuild index supported by Task Manager is not sensitive to execution time, and support for query operations such as count () is still under development. Considering that users want to complete count () in a relatively short period of time, if you happen to run storaged doing multiple compaction, you still want count (*) to run first instead of starting after all compaction.

If there are any mistakes or omissions in this article, please go to the GitHub: https://github.com/vesoft-inc/nebula issue area to mention issue to us or to the official forum: https://discuss.nebula-graph.com.cn/ 's suggestion feedback category to give suggestions; to join the NebulaGraph exchange group, please contact NebulaGraph's official mini assistant WeChat: NebulaGraphbot

The author has something to say: Hi, I am lionel.liu, I am a graphic data Nebula Graph R & D engineer. I have a strong interest in database query engine. I hope this experience sharing can help you all. If there are any irregularities, I also hope to help correct them. Thank you.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.