How to solve the problem of SLA Governance 02/14 Update SLTechnology News&Howtos

How to solve the problem of SLA Governance

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to solve SLA governance problems". The explanation content in this article is simple and clear, and it is easy to learn and understand. Please follow the ideas of Xiaobian to study and learn "how to solve SLA governance problems" together.

I. Background

SLA (Service Level Agreement): A service level agreement that guarantees the availability of services on a website for Internet companies. Data SLA, i.e. data availability assurance, generally takes data production time as SLA.

In the mass data task development scenario, due to business diversification, large data volume, complex data tasks and other problems, the data task link dependence is complex, the link length is long, and there are many cross-team node dependencies. Therefore, in the actual development and operation process, the task leader will encounter the following difficulties in order to ensure the timely output of his own data:

High communication costs: Task owners try to agree SLAs with upstream task owners, but communication costs are high due to the large number of upstream tasks (up to thousands) and spanning multiple teams

unclear rights and responsibilities: how to establish SLA due to link complexity? Who is responsible for SLA protection?

High O & M pressure: Unable to discover the upstream task delay in time, resulting in the downstream task leader bearing most of the O & M pressure, and the O & M effect is poor, often finding that the delay has missed the remedial time.

In order to solve the above problems, the ByteDance (ByteDance) data platform standardizes and promotes the task link governance of each business team through the self-developed SLA guarantee platform, effectively guarantees the SLA of data, and the data SLA compliance rate reaches 99.1%.

The relationship between the completion time of an ideal set of tasks and the corresponding SLA is shown in the following figure, that is, each task and its upstream tasks are completed before the corresponding SLA, which is also the governance goal of the platform.

II. Application scenarios

In addition to solving the difficulties mentioned above, the SLA assurance platform also has the following usage scenarios for different users:

Data business side: "Our team's business depends on an important piece of data, hoping to guarantee it, hoping that the upstream can promise SLA"

Data director: "Our team has a lot of SLA data promised to the outside world. We hope to have a platform to centrally manage SLA and provide some statistical market, risk analysis and other contents."

Data governance party: "We hope to improve the data quality of core data within the team, align SLA management, timely discover risks, and carry out accident recovery and improvement, and finally continuously optimize data quality."

According to the above different role requirements, SLA assurance platform proposes its own solutions. The platform provides perfect governance kanban capability for team data governance requirements; simplifies SLA achievement process through various optimizations for SLA difficulties caused by complex task links; optimizes notification system for problems with high O & M pressure of downstream tasks, and timely broadcasts SLA status.

So, what are the core modules of SLA assurance platform? How does the platform work?

III. Introduction to core concepts 3.1 Roles:

There are currently three core roles of SLA assurance platforms, namely:

Claimant: The person submitting the SLA declaration, generally the data business party, whose purpose of submitting the declaration is to protect the SLA of the business data;

Administrator: The role set to meet the needs of the data governance party, responsible for the review, approval, management, statistics, registration, re-listing, etc. of the application, with the aim of continuously optimizing the data quality of the team.

Task leader: the task owner in the SLA data link to be guaranteed, responsible for determining and signing the SLA of the task in charge, and the platform will guarantee according to the signed SLA;

3.2 Mission:

That is, the task that produces data, through the metadata of the data task, a complete DAG of the entire data production link can be constructed. In this platform, the involved task meta information generally needs to include the following contents:

3.3 declaration form

A declaration filed by a declarant is called a "declaration form." The core contents of a declaration form are generally as follows:

Element describes the task declared by the declarant, that is, the task that the declarant hopes to guarantee, also known as the starting point task expectation SLA declarant hopes to declare the output time of the task, will sign the management team data management party directly according to the time, the declaration will be approved by the administrator of the management team and managed.

SLA guarantees are based on SLA agreements. In SLA platform, SLA agreement is reached in the form of declaration. The core feature of the platform is to optimize the SLA achievement process, firstly reduce the number of tasks to be signed through "system card point calculation," then automatically sign some tasks through "SLA recommendation calculation," and finally intelligently provide appropriate SLA for the remaining tasks to be signed, further reducing the signing cost.

In the process of declaration and signing, the changes in each link will transmit information to the corresponding person in charge through the notification module, and the real-time notification will reduce the cost of information exchange and accelerate the achievement of SLA.

4.1 Process brief

The above figure shows the general process of declaration and signing. In actual operation, such as task link change, SLA time discussion to be confirmed and other special circumstances, the declaration and signing process will be fine-tuned.

First, the applicant needs to fill in the declaration form. After the applicant submits it, the system will pull all upstream tasks according to the declaration tasks in the declaration form to form a complete DAG and perform task link analysis. The result of link analysis is the premise of subsequent algorithms and an important reference factor for administrators to approve, allowing users to quickly understand the position of their tasks in the link and the upstream and downstream operations.

Ideally, for a declared task to proceed smoothly, all upstream tasks of the task need to sign SLAs before signing is complete. However, many upstream tasks caused by complex links, high cost of cross-team communication, and difficulty in determining SLA have become the biggest obstacles to achieving the overall SLA. This obstacle can be overcome by "stuck point calculation" and "SLA recommendation calculation".

4.2 stuck point calculation

This system adopts a certain "stuck point strategy" to calculate some tasks in this DAG that need to be signed. This kind of task is called "stuck point task", and this process is called "stuck point calculation". After calculating the card point task, other tasks can be ignored in the signing process, thus greatly reducing the signing cost.

A declaration will be associated with multiple tasks (i.e., the declaration task and its upstream card point tasks). Similarly, a task will also be associated with multiple declarations, because in a DAG, the declaration task may start from any node, so the relationship between the two is N:N.

When two declaration forms have partial overlapping task lists, for example, Task4 is associated with two declaration forms, and the data such as the applicant and governance team of the task are the de-duplication set of the two declaration forms, and the grade is the highest among all the declaration forms.

4.3 SLA recommendation calculation

Based on the historical running information of a task and its upstream and downstream tasks, combined with recommendation algorithm, the recommended SLA of the task is obtained. This process is called SLA recommendation calculation.

Before the person in charge signs the SLA, the SLA recommendation algorithm intelligently calculates the recommended SLA for each task, and further automatically signs some tasks to be signed through the algorithm, further reducing the signing cost. According to platform statistics, this feature can automatically sign nearly 40% of SLAs, which is one of the most core features.

For the remaining tasks to be signed, the SLA recommended by the algorithm is provided to the task leader. Task owners can choose to sign directly with this SLA, or they can decide on their own SLA. In general, intelligent recommended SLAs already meet most requirements, and by recommending SLAs, task owners make signing decisions faster, reducing signing costs again.

4.4 system assurance monitoring

When a declaration is signed, the platform will guarantee the tasks in the declaration. The core of support services is to monitor SLA status changes and broadcast timely message notifications to provide timely first-hand information to the corresponding responsible person, so as to reduce operation and maintenance costs. For an offline task, evaluating its SLA is mainly based on its completion time and its promised SLA to judge, SLA status is divided into four types, respectively:

SLA not reached: i.e. current time, task not produced, and SLA time not reached (continue monitoring);

Achieved: i.e. the task has been completed and completed before the promised SLA (ready notification sent);

Delayed: i.e. the task has not been completed and the current time is already after the promised SLA (sending a delayed notification);

Delayed (output): i.e. the task has been completed, but the completion time is after the promised SLA (notification of delayed output sent);

From the figure below, you can see the SLA status changes over time under the two cases of task achievement and non-achievement.

The real-time status of SLA is important information required by the data service party, so the platform monitors the SLA of all tasks and sends real-time notifications to relevant personnel when the SLA status changes. Relevant personnel know the specific situation of SLA according to the received notifications and can make countermeasures.

V. Detailed explanation of duplicate disk management

Multi-disk management is the implementation mode of responsive governance service provided by this platform, and it is the key attention object of data governance party. Re-inventory management is divided into problem management and accident management. Problem management focuses on "why"_that is, sorting out and analyzing the causes of SLA broken line, and accident management focuses on "how to do"_that is, how to deal with SLA broken line accident.

5.1 issue management

The overall goal of the problem management module is to satisfy the registration management of SLA problems by the data governance team, support the root cause data analysis of different dimensions after registration of problem data, assist users to manage the root causes of problems, and accumulate the experience of problem management.

During system assurance monitoring, the platform will broadcast notifications when SLA is delayed, and continuously remind the responsible person to register problems. During problem registration, the platform provides a set of root cause tree auxiliary registration to clarify the root cause category of the problem and facilitate statistical analysis. After the task leader registers the problem, the accumulated data is displayed on the problem kanban, and the data management party analyzes and summarizes the problem from this.

The platform ensures a one-to-one correspondence between SLA delay records and problems, and associates SLA details on the problem kanban, including task links, responsible persons, task start and end times, etc.

Problem registration is often a process from more to less. After the problems in the early stage are solved one by one, they will play a good reference and warning role in the later stage of governance. Its data value is as follows:

Trend distribution of SLA problem types, targeted governance problems

How many SLA issues are caused by the same root cause and how many data assets are affected

Which data assets frequently experience SLA problems, the classification of the problem, and what root causes are causing it

SLA problem experience summary, convenient after similar problems occur, make recommendations later to help quickly locate root causes

According to the platform operation records, common problems include resource queue congestion, upstream task failure, data skew, etc. A bi-monthly problem registration for a data team is summarized as follows, with the number of problems and the type of problem root causes effectively converging:

Number of bimonthly issues Root cause category 2019-07/0877122019-09/1058102019-11/123372020-01/022352020-03/041742020-05/06922020-07/08925.2 Incident management

Incident management is used to record SLA line failure and improve management. Each incident corresponds to at least one SLA problem record, and each SLA problem may not cause an incident.

An incident can occur at any node. Generally, incident registration is required after SLA breaks and causes actual business impact. Incident registration will also be associated with relevant SLA information. The process of handling an accident is as follows:

As shown in the figure, incidents mainly include SLA incident details, SLA incident root cause, improvement plan and SLA consumption. The following points can be paid attention to here:

When the accident is registered, the root cause of the accident will be confirmed according to the accident details, and the corresponding responsible person will propose an improvement plan.

Subscribers can subscribe to incidents, and will be notified when the status of recovery and completion of improvement plans changes.

The improvement plan of the task will be reminded to the project leader every day until the plan is completed.

The data of SLA incident management platform is an important basis for the governance results of data governance party, and also reflects the use effect of the whole SLA assurance platform. Its data value is as follows:

Re-disk archiving management of accidents, convenient for later reference at any time, positioning related SLA information

Compare and view the overall situation of SLA incidents in different data teams, and learn from each other

Track the improvement plan management of accidents and check the effectiveness of SLA.

Here are the bimonthly accident statistics for a team:

Number of bimonthly accidents compared with 2019-07/0846- - -2019-09/1026 -43%2019-11/1218-31%2020-01/0213-28%2020-03/047-46%2020-05/066-14%2020-07/085-16%

It can be seen from the above data that this platform effectively guarantees the stable output of core tasks and assists in reducing the probability of stability accidents. Now, the number of accidents of this type per bimonthly has been maintained in single digits for a long time.

VI. Summary of Platform Architecture

The platform as a whole is mainly divided into three parts: basic components, planning governance services and responsive governance services. The system component architecture diagram is as follows:

6.1 Planning governance services

The so-called "planning governance" means governance before problem discovery, ensuring task output through active planning and SLA. Programmatic governance is the process of SLA related problem discovery.

Planning governance service means "providing service for reaching SLA agreement by means of declaration form signing," including lifecycle management operation of declaration form in this process, link analysis of declaration task, and system assurance monitoring after reaching SLA, serving "declaration signing process."

6.2 Responsive Governance Services

Responsive governance refers to the process of registering, managing and re-listing SLA-related accidents/problems through the re-listing management module. After the SLA-related problem is discovered, it needs to be processed to form a complete closed loop, and the governance after the problem is discovered becomes reactive governance.

Responsive governance service module abstracts problem registration and incident management modules, and more flexibly serves problem attribution and incident statistics of data SLA.

6.3 basic components

The basic component provides basic function module services such as configuration, broadcast, kanban, etc., providing necessary support for planning and responsive governance services, and is an indispensable part of the overall SLA guarantee service.

6.3.1 System configuration

Governance Team Configuration

The governance team is the SLA management team. Each declaration form needs to be bound to a governance team, which is mainly responsible for approving the declaration form.

Data Team Configuration

The data team is the owner of the data, and one data team corresponds to one business team. The design of the data team ensures the independent governance requirements of each business team. Through flexible configuration support for data teams, the platform can divide data and task attribution in finer granularity and solve the problem of unclear rights and responsibilities.

subscription configuration

Subscription management is a platform for configuring subscription information. Subscriptions on this platform are notification broadcasts for SLA monitoring. Notifications can be specified to individuals or groups through subscription management. Subscription management is an integral part of SLA monitoring and assurance services.

6.3.2 Notification broadcast

Notification broadcasting is the basic notification capability provided by this platform, and it is an important means to reduce communication costs, realize guarantee services and improve user experience. Notification will be broadcast in case of important node change, user operation, SLA status change, etc. Notification broadcast forms are various, according to different scenarios, there are ordinary text messages, urgent messages, card notifications, mail notifications, telephone notifications, etc.

6.3.3 SLA Panel

SLA panel is the most concerned part of data governance party. The panel provides rich information such as SLA overall statistical information of the day, SLA delay trend analysis information, SLA level distribution details, task health details, team SLA achievement information statistics, etc. It is an important reference source for many team data governance indicators.

Thank you for reading, the above is "SLA governance problem how to solve" the content, after the study of this article, I believe we have a deeper understanding of SLA governance problem how to solve this problem, the specific use of the situation also needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.