Construction practice of Task scheduling platform for Yixin Micro Service | sharing record 07/12 Update SLTechnology News&Howtos

Construction practice of Task scheduling platform for Yixin Micro Service | sharing record

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Content source: the 4th phase of Technical Salon of Yixin Institute of Technology-online live broadcast | Construction practice of task scheduling platform for Citic Micro Services

Speaker: Liang Xin, Senior architect & head of Development platform

Guide: nowadays, both Internet applications and enterprise applications are filled with a large number of batch tasks, and we often need some task scheduling systems to help us solve problems. With the gradual evolution of micro-service architecture, single architecture has gradually evolved into distributed and micro-service architecture.

In this context, many of the previous task scheduling platforms can no longer meet the needs of business systems, so there are some distributed task scheduling platforms. These platforms have their own characteristics, but also have their own shortcomings, such as do not support task scheduling, high coupling with business, do not support cross-platform and other problems, do not very much meet the needs of the company, so we developed a micro-service task scheduling platform (SIA-TASK). This sharing mainly focuses on the SIA platform, including research and development background, design ideas and technical architecture, as well as how to support the business side.

I. the background of the emergence of SIA-TASK

Both Internet applications and enterprise applications are filled with a large number of batch tasks, and we often need some task scheduling systems to help us solve problems. With the gradual evolution of micro-service architecture, single architecture has gradually evolved into distributed and micro-service architecture.

In this context, many of the previous task scheduling platforms or components can no longer meet the needs of business systems, so there are some distributed task scheduling platforms. These platforms have their own characteristics, but also have their own shortcomings, such as do not support task scheduling, high coupling with business, do not support cross-platform and so on.

1.2 species

According to the relationship between tasks and time, we divide batch tasks into three categories: aircraft, subway and bus models.

An airplane is a task that is performed at a fixed time every year / month / week / day. This kind of task is very common in our business system, such as performing a batch task at 1: 00 every day to clean up the log of the previous day, and paying all the staff of the company on the 10th of each month, which are all aircraft tasks. Subway type refers to the execution of tasks at regular intervals, not concurrently. We also often encounter such batch tasks, the first task is not finished, the second task can not be executed, this is not concurrent. Bus models mean that tasks are carried out at regular intervals and can be carried out concurrently. If it is a bus task, the previous task is not finished, and the next task can be carried out on time. 1.3 question

You will encounter the following problems in the process of running batch tasks:

Forget the scheduled tasks that are still running. There was such a case in our company. One winter several years ago, one of our project teams spent three months working on a project. After running it for a period of time, we found that the effect of the project was not very satisfactory. We stopped all the related programs, but forgot that a node running batch tasks was still running until two years later, the log generated by this node filled up the disk and triggered the monitoring alarm. We just found out. Single point, that is, there is no hot standby, run batch task is a single point of running scheduled task, if something goes wrong, it needs to be transferred to manual processing. Dependency, using the time difference to deal with the data that repeatedly causes problems. We all know that projects sometimes need to have dependencies. For example, there is a priority between run-batch process An and run-batch process B of a project. The project team sets run-batch process A to run at 2: 00 a.m., and run-batch process B to run at 4: 00 a.m., ensuring the priority in terms of time. In case the execution time of run-batch process An is too long, more than 2 hours, it will lead to data problems, and the data with problems need to be processed manually. 1.4 relationship

It is mentioned earlier that there is a relationship between tasks, so what is the relationship? I think there are three main types:

Serial, there is a sequential relationship between the two tasks. That is, task B is executed after task A, and task An is executed before task B. Parallel, two tasks that can be executed concurrently. For example, tasks B and C are executed after task A, and after task An is completed, tasks B and C can be executed at the same time, then B and C are parallel. Branch, judge according to the return result of the previous task, different results perform different follow-up tasks. For example, when 0 is returned, task An is executed, and task B is executed when 1 is returned, which is a branching case. 1.5 thinking

Based on the above relationships, we will consider the following two aspects when building a task scheduling platform:

Platform. The project team always wants to devote more energy to business development and to focus other things that have nothing to do with business development on the architecture team as much as possible. They want to have a platform to perform tasks, just put the written business logic on this platform, this platform will do all the work, and the project team only needs to care about the business logic. Micro services. In order to better meet the needs of the project, we hope to separate the business logic of the task from the scheduling of the task, use the registration and discovery mechanism to build the task scheduling platform, and hand over the business-related parts to the project team. Leave the rest to the task platform. 1.6 factors

In addition to the above two considerations, we also need to consider the following eight factors.

Task scheduling. There is a process order for scheduled tasks among multiple businesses. It is mentioned earlier that there are parallel, serial and branch relationships among tasks. We hope that the platform can have corresponding scheduling functions to handle and support these tasks. Task slicing. For a large task, it needs to be executed in pieces in parallel. Cross-platform. In addition to projects that use Java technology stacks (SpringBoot, Spring, etc.), you also need to be able to support applications in other languages. No invasion. The business does not want to be highly coupled with scheduling, but only focuses on the execution logic of the business. It is hoped that the platform will be non-intrusive to the code of the business itself, and the impact will be minimized. High availability / failover. The scheduling system itself must ensure high availability, there can be no single point, and there are compensation measures for problems encountered in the process of task execution, which can be handled smoothly and reduce manual intervention. Visualization. The operation of task scheduling provides a visual page for easy to use. Real-time monitoring. The platform should have a real-time monitoring system to obtain the execution status of tasks in real time. Dynamic editing. The task clock parameters of the business may be changed. on the basis of visualization, the operations performed on all tasks are reflected in the business system in real time without downtime deployment.

Based on the above background and consideration, we build a micro-service task scheduling platform SIA-Task.

Second, a brief introduction to the core design idea of SIA-TASK.

SIA is the abbreviation of "Simple is Awesome".

SIA-TASK (micro-service task scheduling platform) is one of the important products. SIA-Task fits the current micro-service architecture model, and has the characteristics of cross-platform, orchestration, high availability, non-invasion, consistency, asynchronous parallel, dynamic expansion, real-time monitoring and so on.

SIA-TASK is an integrated solution for task scheduling, which collects metadata for tasks, then visually arranges tasks, finally schedules tasks, and monitors the whole process of tasks, which is easy to use. There is no intrusion into the business, and a task scheduling model in line with expectations can be generated through simple and flexible configuration.

Drawing lessons from the design idea of micro-service, SIA-TASK obtains the task metadata distributed on each task executor and uploads it to the task registry. Using online task scheduling, we can dynamically modify the task clock, use HTTP as the task scheduling protocol, uniformly use JSON data format, and the scheduling center parses the clock, executes the task flow, and notifies the task.

2.2 terminology

Briefly introduce the terminology of SIA-TASK.

Task (Task): basic execution unit, an HTTP calling interface exposed by the executor; Job: composed of one or more tasks that are logically related (serial / parallel), the smallest unit scheduled by the task scheduling center; Plan: composed of several sequentially executed jobs, each job has its own execution cycle, and the plan has no execution cycle Task scheduling center (Scheduler): scheduling according to the execution cycle of each job, that is, making HTTP requests according to the logic of plans, jobs, and tasks. It is a separate node. Task orchestration Center (Config): orchestration Center uses tasks to create plans and jobs; Task Actuator (Executer): receives HTTP requests for business logic execution; Hunter:Spring project expansion package is responsible for capturing tasks in the executor and uploading to the registry, business can rely on this component for Task writing. The relationship among Job, Task and Plan

Task is the basic unit of business execution, and the executor exposes a HTTP calling interface. Several Task form a Job, while Plan is composed of several sequentially executed Job.

Why do you need a Plan here? Sometimes two tasks are not only sequentially related (that is, task An is executed before task B), but also need to meet certain time requirements, such as task An at 10:00 and task B at 2pm. And task A must be completed on time at 10:00

For example, there is a live broadcast of a football match at 8 o'clock tonight. If I can't get home at 8 p.m., I won't be able to watch it live. If I get off work early and get home at more than 6 p.m., I have to wait until 8 o'clock to start watching the game. This is the source of the Plan program.

2.3 composition

The SIA-TASK task scheduling platform consists of the following parts:

The task executor is where your business code is, which belongs to the project team. Mission registry, we use ZooKeeper. Task choreography center persistent storage, we use MySQL. Task scheduling Center 2.4 running

Next, the running logic of SIA-TASK is described in detail.

First of all, report to the task registry by annotating the tasks in the task executor. When the task executor starts, there will be an annotation called online Task. As long as you liberate this note to the method of control code, it will automatically grab the HTTP interface and report it to the task registry. Here we use ZooKeeper.

The task orchestration center obtains data from the task registry for orchestration and preservation into persistent storage. In other words, in the executor, grab the URL address, port and other instances requested by the business calling the HTTP API and upload them to the ZooKeeper, and the ZooKeeper will get the tasks one by one, and the ZooKeeper will grab the information of the task itself and put it in the MySQL.

Here we want to distinguish between what is a task and what is a task instance. The relationship between task instance and task is a bit like the relationship between class and object, that is, a business logic code may be deployed on multiple nodes, that is to say, the business logic code of these nodes is exactly the same. during the run-time grab, the business logic code on each node will be fetched, and it is a task for this business. But each port and each IP address may correspond to a task instance. For example, when hot backups are highly available, we will save the information of the task itself to persistent storage after processing, while the information of the instance itself will only stay in the ZooKeeper.

The task configuration center can configure according to the information in ZooKeeper and MySQL, that is, add clock and strategy to these Task according to the captured tasks, then arrange Job and Plan, and save the current information to MySQL.

The task scheduling center obtains scheduling information from persistent storage and knows the scheduling logic such as Job, Plan, clock, policy and so on. According to the scheduling logic, the task scheduling center accesses the task executor and schedules these Task fetched from the executor.

This is the running logic of SIA-TASK, and we will store the scheduling log in Kafka.

2.5 feature 1) automatically crawl tasks based on annotations

Add @ OnlineTask annotation to the method exposed as a HTTP service. @ OnlineTask will automatically grab the IP address, port, request path, request method, request parameter format and other information of the method to be uploaded to the task registry (zookeeper), and write the task information to persistent storage synchronously.

2) non-intrusive multithread control based on annotations

A single task instance must be run in a single thread, and the task scheduling framework automatically intercepts @ OnlineTask annotations for single-thread running control, so that it will not be scheduled again when a task is running. And the whole control process is completely unaware of the developer.

That is, on a task instance, make sure that the task runs in a single-threaded state. In fact, this is controlled by the user, if it is single-threaded, it can be controlled here; if it is multithreaded, it can be uncontrolled. This control does not require additional code, just needs to be handled on the annotations.

3) highly flexible task scheduling mode

The design idea of SIA-TASK is to take tasks as atoms and combine multiple tasks according to the relationship of execution to form a Job. At the same time, the runtime is divided into task scheduling center and task scheduling center, so that job scheduling and job scheduling are separated and do not affect each other. When we need to adjust the flow of the job, we only need to process it in the choreography center. At the same time, the orchestration center supports the task to organize the relationship according to serial, parallel, branch and so on. Different task instances in the same task, also support a variety of scheduling methods for processing, and the entire processing arrangement is completed on the page, this function is very easy to use, which is also a bright spot of the SIA-TASK platform.

4) Scheduler adaptive task allocation

When there is a failure or exception in the process of task execution, the task can be reawakened at multiple points according to the strategy customized by the task to ensure the uninterrupted execution of the task. We have set a lot of strategies, such as what if something goes wrong with a certain Task? Is it to wake up again? Or don't care? Or manual intervention to sound the alarm? We have customized a lot of strategies to deal with these problems.

2.6 key points

Now that we understand the features of the platform, let's sort out the technical key points of SIA-TASK.

Task flow. Realize the configurable flow relationship between tasks and tasks, and form a directed acyclic graph (DAG). The task flow can start with a timing time (Cron expression) or an external request (providing an API address) and execute according to DAG logic. Metadata management. The management of each task metadata in the micro-service synchronizes the data capture and input. Intelligent operation and maintenance. Visual task real-time monitoring, all monitoring can be seen on the page; real-time early warning mechanism, when there is a problem, it will send e-mail or SMS to relevant personnel to warn; semi-intelligent self-repair, sniffing and retry without human intervention. Resource isolation. Resource isolation between processes; resource isolation within processes to improve system throughput and provide stability. The clock uses Core Schedule. A scheduling center uses a Core Schedule for a project team. When each project team is scheduled on the same schedule, it is isolated on the same scheduler. If something goes wrong with one project team, it will not affect other project teams, which represents isolated load balancing. Load balancing. When scheduling tasks in the scheduling center, the execution cycle time of tasks is different. Some tasks may take a longer time, some tasks need a shorter time, and the resources of the scheduler are also different. Some CPU is higher and some CPU is lower, so how to ensure scheduling load balance? How to ensure the load balancing of resource isolation? We will consider the historical value of this kind of task scheduling (task time) and the performance of the machine itself, so that each task scheduling center has about the same number of scheduling and consumption. This is a new load, not a simple traffic load. Third, SIA-TASK composition module 3.1 home page

The home page of task scheduling management mainly includes three parts: scheduler information, scheduling times, docking project details.

Scheduler information: the number of dispatchers in the dispatch center. Number of dispatches: the total number of Job dispatched by the dispatching center. Docking project details: the total number of project teams and Job connected by the dispatching center.

At present, 51 projects have been connected to the SIA-Task platform, and there are more than 600 Job running on it. In this year's online version, Job has run more than 30 million times.

There are several values on the scheduler to know, and each scheduler has three metrics.

Upper limit of Job: dynamic threshold of Job that can be loaded; number of Job runs: the number of Job currently running by the scheduler; Job early warning value: when the number of Job run by the scheduler exceeds the early warning value, the administrator will be notified by email. 3.2 Scheduler management

There are several pieces of information to know about the scheduler. As shown in the figure, clicking on a scheduler (bar chart) will display a list of Job details preempted by the scheduler:

JobKey: the Job name configured, and each Job has its own name. Type: configure the scheduled task types of Job, which are divided into two categories: Cron and fixRate. Job type value: if it is a Cron expression, how to write the 6-bit timest if it is fixRate, it is how long it takes. Alert mailbox: the alert mailbox configured by this Job. Description information: describe the functional information of the Job so that the administrator can quickly discover the details of the Job preempted by a dispatcher.

The scheduler includes work scheduler, offline scheduler, offline scheduler and whitelist.

Work scheduler: this type of scheduler has the ability to preempt and schedule Job. If a scheduler goes offline, it will immediately lose the ability to preempt Job. After the preempted Job is executed, it will be automatically released and then preempted by other schedulers. After being offline, the scheduler will enter the offline scheduler list. The work scheduler list provides offline and batch offline functions. To put it simply, a work scheduler is a scheduler that is working. Offline scheduler: this type of scheduler process is still alive, but loses the ability to preempt Job and participate in scheduling. If you go online to this kind of scheduler, you will enter the list of work schedulers and begin to have the ability to preempt and schedule Job. The list of offline schedulers provides the functions of online and batch online. That is to say, the offline scheduler is still alive, but it is no longer involved in preempting Job, and the existing Job will continue to complete. If you click online, you will once again have the ability to preempt Job and become a work scheduler. Offline scheduler: this kind of scheduler process no longer survives. When the offline scheduler process dies, it will automatically enter the offline scheduler list. When this kind of scheduler process is restarted, it will automatically enter the offline scheduler list. The offline scheduler list also provides deletion and batch deletion. Generally speaking, there is a problem with the offline scheduler, either the process is down, or the network is down. Whitelist: after an IP is added to the whitelist, it has the permission to call all executor instances. The whitelist provides the feature of batch deletion, and this permission is automatically lost when the IP is deleted. 3.3 scheduling monitoring

The above figure shows the scheduling monitoring page of SIA-TASK, which is divided into a piece of area belonging to different project groups. At present, SIA-Task has access to 51 projects, of which more than 500 are in preparation and 25 are in operation.

Some Job execution is very fast, and some Job execution is finished in a few seconds, and some Job execution is very slow and takes a long time. When we grab the state, we can only grab the Job for a long time. These captured Job show that they are running, while the short time can not capture them, but they are all in the execution state, and these Job that have not been caught are shown as preparation.

Some Job may not need to be run during this period of time and can be stopped manually. The rest is the abnormally stopped Job, which needs to send an email alarm.

We also provide the ability to retrieve and accept different project groups to log in to query their own project running status.

3.4 Task Management

In the Task management interface, Task is displayed in groups according to the project group, which mainly provides the functions of Task configuration, modification and deletion. Task consists of two parts: one part of Task uses sia-Task-hunter components to achieve automatic crawling of Task through standard annotations, and this kind of Task is not allowed to be modified; the other part of Task is added manually by users. I know the accessed URL and HTTP addresses are added manually. This part of Task supports cross-platform crawling, and can be modified and deleted.

A Task management consists of the following parts: project name, application name, task name, machine address, description, and operations such as view / modify / connectivity tests. The same Task name, different machine address, represents a task and a different task instance.

3.5 Job Management

Previously, a Job is composed of several Task. Each different column in the figure represents the project name. Click the drop-down list to display all the items, and you can filter, add, status view, and so on.

Among them, the status operation can be performed manually, and it can be stopped or activated after the Job,Job is configured to belong to the inactive state, which needs to be activated. You can also modify the information in Job, configure Job, and so on.

How do I add Job? If I want to add a Job of Cron expression type, what do I need to add?

Because Job is an Cron expression type, first I need to enter six-digit expression content, I also need to add an alert mailbox, then describe the Job, each Job has a key, and finally I need to add Job_key. Such a new Job will be added.

In retrospect, adding Job requires configuring Task information, which is a more complex process. A Job consists of several Task, and we can use a drag-and-drop way to determine the sequential relationship of all the Task that make up the Job based on the relationship between the Task. Different colors can also be used to represent different projects, of course, only administrators have permission to see all projects, and the person in charge of each project can only see the status of the project to which they belong.

There are some parameters when uploading Task, so it also involves the processing of parameters, such as parameter type, parameter value, expiration time, etc. Let's focus on the expiration time.

Calling via HTTP encounters a problem: exactly when the Task will be executed. To solve this problem, you need to set an expiration time for Task, which will be transferred to other strategies, such as abandonment or manual processing, as soon as the expiration time is up. Because as an asynchronous call, it is impossible to wait endlessly for the client to return the result.

Of course, there may also be a situation in which the result I got was a timeout, in fact the task was executed correctly, and the result was returned to me after a while. We have designed a queue compensation mechanism to deal with this problem, but it doesn't seem to make much sense. Of course, this is only a possibility, the platform online has not appeared so far.

Currently, there are two strategies for selecting Task_ instances on the platform:

Randomly, from the optional list, randomly select the instance, that is, the IP+ port; fixed IP, specify the instance, and then specify the instance manually from the optional list.

The platform supports four Task_ call failure strategies: STOP, stop policy, if the call fails, the entire Job stops, no subsequent Task;IGNORE is executed, the policy is ignored, the Task is skipped if the call fails, the subsequent Task;TRANSFER continues to be executed, the policy is transferred, and other instances of the Task are selected for execution. If it still fails, the stop policy is used. MULTI_CALLS_TRANSFER, call the retransfer policy multiple times, and call the Task repeatedly. If it still fails, the transfer policy is used. 3.6 scheduling log

Log management provides information about the running log of Job. Grouped by project group, the key elements of a Job log include:

Execution status: indicates the execution result of the Job; execution time: indicates the time when the scheduler schedules the Job; execution completion time: indicates the completion time of the Job execution; scheduling information: indicates the scheduler instance of the execution Job; execution information: the specific information of the Job execution, and the association between the Job and the execution log information of the referenced Task has been realized, and the log is saved for seven days by default. IV. Open source

As an important product of the SIA team, SIA-TASK has connected dozens of projects and run hundreds of Job in the company, which has stood the test of stability.

SIA- task micro-service scheduling platform has been open source since May, open source address: https://github.com/siaorg/sia-Task, interested students can log in to view the detailed introduction.

Shared by: Liang Xin

Source: Yixin Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.