How to develop highly reliable services based on K8s 07/07 Update SLTechnology News&Howtos

How to develop highly reliable services based on K8s

2025-07-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to develop highly reliable services based on K8s? in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

MySQL on k8s

The design and development of applications cannot be separated from business requirements, and the requirements for MySQL applications are as follows:

High reliability of data

High service availability

Easy to use

Easy operation and maintenance

In order to achieve the above requirements, we need to rely on the cooperation of k8s and applications, that is, the development of highly reliable applications based on k8s requires not only knowledge related to k8s, but also knowledge in the application field.

The following will analyze the corresponding solution according to the above requirements.

1. High reliability of data

The high reliability of data generally depends on these aspects: redundant backup / recovery

We use Percona XtraDB Cluster as the MySQL cluster scheme, which is the MySQL architecture of multi-master, and the real-time data synchronization between instances is based on Galera Replication technology. This clustering scheme can avoid the data loss that may occur during the master-slave switching in the cluster of master-slave architecture, and further improve the reliability of the data.

In terms of backup backup, we use xtrabackup as the backup / recovery solution to achieve hot backup of data, which does not affect the normal access of users to the cluster during the backup period.

While providing "scheduled backup", we also provide "manual backup" to meet the business needs of backup data.

two。 High service availability

This paper analyzes it from the perspectives of "data link" and "control link".

"data link" is the link for users to access MySQL services. We use the MySQL cluster scheme with three master nodes to provide access to users through TLB (Qiniu's self-developed layer-4 load balancing service). TLB not only realizes the load balancing of the MySQL instance at the access level, but also realizes the health detection of the service, automatically removes the abnormal node, and automatically adds the node when the node is restored. As shown below:

Based on the above MySQL cluster scheme and TLB, the exception of one or two nodes will not affect the normal access of users to the MySQL cluster, ensuring the high availability of MySQL services.

"Control link" is the management link of MySQL cluster, which is divided into two levels: global control management of each MySQL cluster. Global control management is mainly responsible for "creating / deleting clusters", "managing all MySQL cluster states" and so on. It is realized based on the concept of Operator. Each MySQL cluster has a controller, which is responsible for "task scheduling", "health detection", "automatic fault handling" and so on.

This disassembly devolves the management of each cluster to each cluster, which reduces the mutual interference of control links between clusters and reduces the pressure on the global controller. As shown below:

Here is a brief introduction to the concept and implementation of Operator.

Operator is a concept put forward by CoreOS, which is used to create, configure and manage complex applications. It consists of two parts: Resource custom resources provide users with a simple way to describe expectations of services, Controller creation, Resource listening for Resource changes, to achieve users' expectations of services.

The workflow is shown in the following figure:

This is to say:

Listen for changes to CR objects

The user performs CREATE/UPDATE/DELETE operations on the CR resource

Trigger the corresponding handler for processing

Based on practice, we have abstracted the development of Operator as follows:

CR is abstracted as a structure like this

The operation of CR ADD/UPDATE/DELETE events is abstracted as the following API:

On the basis of the above abstraction, Qiniu provides a simple Operator framework, which transparently creates CR, listens to CR events, and makes the work of developing Operator easier.

We developed MySQL Operator and MySQL Data Operator to be responsible for "creating / deleting clusters" and "manual backup / restore", respectively.

Since there are many types of task logic in each MySQL cluster, such as "data backup", "data recovery", "health detection" and "automatic fault handling", the concurrent execution of these logics may throw exceptions, so a task scheduler is required to coordinate the execution of tasks. Controller plays this role:

Through Controller and all kinds of Worker, each MySQL cluster implements self-operation and maintenance.

In terms of "health detection", we implement two mechanisms: passive detection active detection "passive detection" is that each MySQL instance reports the health status to Controller, and "active detection" is that the health status of each MySQL instance is requested by Controller. These two mechanisms complement each other to improve the reliability and timeliness of health testing.

For health test data, both Controller and Operator will be used, as shown in the following figure: health check

Controller uses health inspection data in order to find the anomalies of MySQL cluster in time and deal with the corresponding faults, so it needs accurate and timely health status information. It maintains the state of all MySQL instances in memory, updates the instance status according to the results of "active detection" and "passive detection" and processes it accordingly.

Operator uses health inspection data to reflect the operation of the MySQL cluster to the outside world, and intervene in the fault handling of the MySQL cluster when the Controller is abnormal.

In practice, due to the relatively high frequency of health testing, it will produce a large number of health status, if each health state is persisted, then Operator and APIServer will be subject to great access pressure. Since these health states are meaningful only from the latest data, the health status to be reported to the Operator is inserted into a limited capacity Queue at the Controller level, and when the Queue is full, the old health status is discarded.

When Controller detects an MySQL cluster exception, the fault will be handled automatically.

First define the principle of fault handling: do not lose data without affecting availability as much as possible. Automatic handling of known faults that can be handled is not automatic for unknown faults that cannot be handled. Manual intervention in fault handling. There are these key questions: what are the fault types, how to detect and perceive whether there is a fault in time, what is the fault type and how to deal with the above key problems, we define three levels of cluster states:

Green can serve external service. The number of operating nodes can meet expectations. Yellow can run external service. The number of operating nodes does not meet expectations. Red can not serve external service.

At the same time, the following states are defined for each mysqld node:

Green Node in running Node in MySQL Cluster Yellow Node in MySQL Cluster Red-clean Node gracefully exits Red-unclean Node non-graceful exit Unknown Node status is unknown

After Controller collects the states of all MySQL nodes, it calculates the state of the MySQL cluster based on the status of those nodes. When it is detected that the MySQL cluster state is not Green, the "fault handling" logic is triggered, which is handled according to the known fault handling scheme. If the fault type is unknown, manual intervention. The whole process is shown below:

Due to the different fault scenarios and treatment schemes of each application, the specific processing methods will not be described here.

3. Easy to use

Based on the concept of Operator, we have implemented highly reliable MySQL services and defined two types of resources for users, namely QiniuMySQL and QiniuMySQLData. The former describes the user's configuration of the MySQL cluster, while the latter describes the task of manually backing up / restoring data, taking QiniuMySQL as an example.

Users can trigger the creation of a MySQL cluster through the following simple yaml file:

After the cluster is created, users can obtain the cluster status through the status field of the CR object: cluster status

Another concept is introduced here: Helm.

Helm is a package management tool for K8s, which standardizes the process of delivering, deploying and using K8s applications by packaging applications as Chart.

Chart is essentially a collection of k8s yaml files and parameter files, so that applications can be delivered through a Chart file. Helm can deploy and upgrade applications with one click by operating Chart.

Because of the space and the versatility of Helm operation, the specific usage process is no longer described here.

4. Easy operation and maintenance

In addition to the above implementation of "health detection" and "automatic fault handling" and the delivery and deployment of applications through Helm, there are the following issues to be considered in the process of operation and maintenance: monitoring / alarm log management

We use prometheus + grafana to provide monitoring / alarm service. The service exposes metric data to prometheus as HTTP API, which is pulled by prometheus server regularly. The developer visualizes the monitoring data in prometheus on grafana. According to the grasp of the monitoring chart and application, the alarm line is set in the monitoring chart, and the alarm is realized by grafana.

This way of visual monitoring before alarm greatly enhances our grasp of the operating characteristics of the application, defines the indicators and warning lines that need to be paid attention to, and reduces the number of invalid alarms.

In the development, we realize the communication between services through gRPC. In the gRPC ecosystem, there is an open source project called go-grpc-prometheus that can monitor and manage all gRPC server rpc requests by inserting a few simple lines of code into the service.

For containerized services, log management includes two dimensions: "log collection" and "log scrolling".

We type the service log into syslog, and then transfer the syslog log to the stdout/stderr of the container by some means, so as to facilitate the external log collection in a normal way. At the same time, the logrotate feature is configured in syslog to automatically scroll logs to avoid service exceptions caused by logs filling up the disk space of the container.

In order to improve the development efficiency, we use https://github.com/phusion/baseimage-docker as the basic image, in which syslog and lograte services are built-in. The application only cares about putting logs into syslog, not about log collection and log scrolling.

Summary

From the above description, the complete MySQL application architecture is as follows:

In the process of developing highly reliable MySQL applications based on K8s, with the in-depth understanding of K8s and MySQL, we continue to abstract, and gradually implement the following general logic and best practices as modules: Operator development framework, health detection service, fault automatic handling service, task scheduling service, configuration management service, monitoring service, log service, etc.

With the modularization of these general logic and best practices, developers can quickly build K8s-related interactions like building blocks when developing new high-reliability applications based on K8S. such applications have high reliability features from the start because of the use of best practices. At the same time, developers can shift their attention from the steep learning curve of K8s to the application itself and enhance the reliability of the service from the application itself.

This is the answer to the question on how to develop highly reliable services based on K8s. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.