How to build an etcd Monitoring platform under the scenario of 10,000-level Kubernetes Cluster 04/30 Update SLTechnology News&Howtos

How to build an etcd Monitoring platform under the scenario of 10,000-level Kubernetes Cluster

2025-04-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how to build an etcd monitoring platform under a million-level Kubernetes cluster scenario. The editor thinks it is very practical, so I share it with you. I hope you can learn something after reading this article. Let's take a look at it with the editor.

Background

With Kubernetes becoming the dominant container orchestration field, more and more businesses use Kubernetes to deploy and manage services in production environments on a large scale. Tencent Cloud TKE is based on native Kubernetes and provides container-centered, highly scalable and high-performance container management services. Since its launch in 2017, with the popularity of Kubernetes, our cluster size has grown to 10,000. In the process, our basic components, especially etcd, are faced with the following challenges:

How to collect the metrics monitoring data of the etcd and other components of the ten thousand-level TKE cluster through a monitoring system?

How to efficiently manage ten-thousand-level clusters and actively find faults and potential hidden dangers?

How to quickly perceive anomalies and achieve rapid disposal and even self-healing?

In order to solve the above challenges, based on the extension mechanism of Kubernetes, we have implemented a visual etcd platform that includes etcd cluster management, scheduling, migration, monitoring, backup and inspection. This article focuses on how our etcd monitoring platform solves the above challenges.

In the face of the problem of large-scale monitoring data collection, our solutions range from the single Prometheu instance at the beginning of TKE to the dynamic construction of multiple Prometheus instances based on Promethes-Operator, the addition of monitoring Target, and the implementation of a scalable monitoring system based on TKE cloud native Prometheus products, successfully providing stable etcd storage services and monitoring capabilities for 10,000-level Kubernetes clusters. The number of Kubernetes clusters governed by the etcd monitoring platform has also achieved a breakthrough from single digits to thousands and tens of thousands. At present, there are tens of thousands of Prometheus Target per unit time and Series tens of millions of indicator data in a single region. In the face of large-scale monitoring data, the availability of monitoring data can still be maintained at more than 99.99%.

In the face of all kinds of uncontrollable human operation errors, hardware and software failures that may occur in the complex distributed environment, we build a multi-dimensional and scalable inspection system based on Kubernetes extension mechanism and rich etcd knowledge and experience, which helps us to efficiently manage 10,000-level clusters and actively discover potential hidden dangers.

In the face of huge monitoring data and flooding of alarms, we establish a standardized data operation system based on highly available monitoring data and combined with operation scenarios, greatly reduce invalid alarms, improve alarm accuracy, and further introduce multi-dimensional SLO to converge alarm indicators to provide intuitive service level indicators for business parties. Rapid fault handling and even self-healing can be realized through standardized data operation system, alarm classification, alarm follow-up, rising mechanism, simple scene self-healing strategy and so on.

Next, we will tell you in detail how we can solve the above three challenges and quickly build a scalable business monitoring system.

How to build a highly available and scalable monitoring data acquisition service?

First of all, the first question is how to collect the metrics monitoring data of the etcd component of the ten-thousand-level TKE cluster through a monitoring system.

As we all know, etcd is an open source distributed key-value storage system, which is the metadata storage of Kubernetes. The instability of etcd will directly lead to the unavailability of upper services, so the monitoring of etcd is very important.

At the beginning of TKE in 2017, there were few clusters, so a single Prometheu instance can solve the monitoring problem.

In 2018, Kubernetes is becoming more and more popular, and we have more and more TKE clusters, so we introduced Promtheus-Operator to dynamically manage Prometheus instances. Through multiple Prometheus instances, we basically withstood thousands of Kubernetes cluster monitoring requirements. The following is the architecture diagram.

Prometheus-Operator architecture

We have deployed Prometheus-Operator in each region and created different Prometheus instances for different business types. Each time we add a Kubernetes/etcd cluster, we will create ServiceMonitor resources through API and tell Prometheus to collect new cluster monitoring data.

However, in this scheme, with more and more Kubernetes/etcd clusters, the stability of etcd monitoring alarms is challenged, monitoring links are unstable, monitoring curves appear breakpoints, alarms are rampant, false alarms are many, and it is difficult to follow up.

Pain point problem

What are the specific problems?

Here we analyze with you from the perspectives of monitoring instability and operation and maintenance costs.

Monitoring instability

Instability of monitoring components: excessive monitoring data will often cause OOM to occur in Prometheus instances, and frequent changes will trigger Prometheus jams because Prometheus instances will be changed during the etcd process.

Monitoring and business coupling: in order to avoid OOM caused by excessive data volume, it is necessary to split the Prometheus manually to achieve data sharding, which not only increases the maintenance cost, but also because of the existence of automatic management mechanism, the nanotube mechanism is strongly coupled with manual slicing, which is not conducive to later operation and function expansion.

Unstable monitoring link: the monitoring link is mainly composed of Prometheus-Operator and Top-Prometheus. Due to the large amount of data shared with other businesses, Top-Prometheus,Top-Prometheus often restarts OOM. At the same time, due to the large amount of data stored in the local disk, slow startup and long restart time, the impact is further expanded, often resulting in long data breakpoints.

Operation and maintenance cost

Monitoring components need self-maintenance: monitoring data slicing requires manual splitting of monitoring instances, and monitoring components need self-maintenance to ensure monitoring availability.

It is difficult to maintain alarm rules: alarm rules rely heavily on regular matching of etcd names, so it is difficult to maintain rules. For scenarios where new alarm rules are added, you need to understand the configuration of existing rules. Before adding new rules, you need to add anti-selection logic of a specific etcd cluster to existing rules. New operations often affect existing alarms.

Alarm is difficult to follow up: there are many indicators, a large amount of alarm, can not accurately reflect business problems, alarm indicators do not have business characteristics, the business is difficult to understand directly, can not directly return the alarm information to the business side, alarm follow-up is difficult.

In addition, based on open source Prometheus, when adding monitoring Target, it will lead to Prometheus exception, service restart and data breakpoints. At the same time, the monitoring service availability is low due to frequent OOM due to the large amount of monitoring data.

Analysis of problems

As shown in the figure above, the monitoring service is mainly composed of lower-level Prometheus Server and upper-level Top-Prometheus.

Why does the change get stuck?

As shown in the figure above, Secret resources are generated by etcd clusters. Prometheus-Operator generates static_config files of Prometheus instances based on Secret,Prometheus CRD and ServiceMonitor. Prometheus instances ultimately rely on config files to pull data.

Etcd increases = > Secret increases, Prometheus CRD updates = > static_config updates frequently = > pull configuration of Prometheus changes frequently so that Prometheus cannot work properly.

Where does the capacity problem come from?

Under the background of the continuous growth of TKE clusters and the production of etcd online, the number of etcd is increasing, and there are a large number of etcd indicators. At the same time, in order to effectively manage the cluster, various hidden dangers are found in advance and inspection strategy is introduced, and the amount of index data is up to millions.

In addition to collecting etcd indicators, Top-Prometheus also needs to collect other support services. therefore, OOM often occurs in Top-Prometheus, resulting in the unavailability of monitoring services.

Scalable Prometheus architecture

How to solve the above pain points?

The cloud native Prometheus launched by the TKE team was born to solve the pain points in large-scale data scenarios. In order to solve the above pain points, ensure the stability of the data standardized operation base, and provide highly available monitoring services, we decided to migrate the etcd monitoring platform to the TKE cloud native Prometheus monitoring system.

TKE Cloud Native Prometheus Monitoring introduces hot updates of file-sync service implementation configuration files to avoid changes leading to Prometheus restart, which successfully solves the pain points in our core scenarios.

At the same time, TKE Cloud Native Prometheus achieves elastic slicing of monitoring data through Kvass, effectively diverts a large amount of data, and realizes stable data collection of tens of millions of levels.

Most importantly, the Kvass project is open source, the following is its architecture diagram, more reference articles "how to use Prometheus to monitor a 100, 000 container Kubernetes cluster" and GitHub source code.

Cloud Native Extensible Prometheus Architecture

The image above is based on our scalable TKE cloud native Prometheus architecture. Let me briefly introduce the various components to you below.

Introduction of centralized Thanos

Thanos mainly consists of two services: thanos-query and thanos-rule. Thanos-query realizes the query of monitoring data, and thanos-rule aggregates the monitoring data to realize alarm.

Thanos-query:thanos-query can implement multiple Prometheus data query tasks by configuring store fields, and use the query capability to realize the data aggregation of TKE cloud native Prometheus or original Prometheus. At the same time, it also provides a unified data source for upper-level monitoring market and alarm, which plays the role of converging data query entry.

Thanos-rule:thanos-rule relies on the data collected by query, aggregates the data, and implements the alarm according to the configured alarm rules. The convergence of alarm capability and centralized alarm configuration ensure the stability of the alarm link no matter how the lower-level Prometheus service changes.

Smooth migration

TKE cloud native Prometheus is fully compatible with the open source Prometheus-Operator solution, so during the migration process, all the original Prometheus-Operator-related configurations can be retained. You only need to add the corresponding tag to facilitate the identification of TKE cloud native Prometheus. However, due to the migration of metrics exposure from intra-cluster to external TKE cloud native Prometheus, it has an impact on services that rely on monitoring metrics inside and outside the cluster.

External exposure: through the introduction of centralized thanos-query, various regional metrics are exposed through thanos-query. With the upper centralized query, the underlying TKE cloud native Prometheus is migrated or expanded in parallel, without awareness of external services that rely on monitoring metrics, such as monitoring market and alarm.

Internal dependence: the custom-metrics service in the cluster depends on monitoring metrics. Due to the use of TKE cloud native Prometheus, metrics can no longer be collected by internal Service. For this reason, a private network LB is created in the cluster where the cloud native Prometheus resides, so that the supported environment can be accessed internally, and the custom-metrics is configured through the private network LB to achieve the collection of monitoring metrics.

TKE Cloud Native Prometheus effect

Monitoring availability: TKE cloud native Prometheus is based on Prometheus external exposure indicators to measure the availability of its own monitoring services. Common indicators include prometheus_tsdb_head_series and up. Prometheus_tsdb_head_series is used to measure the overall amount of monitoring data collected, and up indicators reflect the health of the collection task. Through these two indicators, you can have an overall perception of the availability of monitoring services.

Data collection success rate: as the business side, we are more concerned about the success rate of specific business indicators collection, in order to effectively measure availability, sampling and landing data of business indicators. Collect the data before and after migration at an interval of 15s, and judge the data drop rate combined with the amount of theoretical data, so as to reflect the availability of monitoring services. According to statistics, the specific data for the past 30 days are as follows:

After the introduction of TKE cloud native Prometheus, the total amount of monitoring data has been as high as tens of millions, the monitoring alarm link is stable, and the data coverage rate of patrol inspection is more than 70%. The success rate fluctuates in a short period of time due to the transformation of the etcd service platform. In addition, the success rate of monitoring indicators is more than 99.99%, and the data has been kept at 100% in the past 7 days, and the monitoring service has maintained high availability.

How to efficiently manage etcd clusters and find hidden dangers in advance?

The second is the second question, how to efficiently manage ten-thousand-level clusters and actively find faults and potential hidden dangers?

In the first problem, we have solved the problem of metrics collection in large-scale etcd clusters. We can find some hidden dangers through metrics, but it is not enough to meet our demand for efficient governance of etcd clusters.

Why do you say that?

Pain point problem

In the process of large-scale use of etcd, we may encounter a variety of hidden dangers and problems, such as the following:

Data inconsistencies occur in etcd clusters due to restart processes, nodes, etc.

Business writing to large key-value leads to a sharp drop in etcd performance

Business exception writes a large number of key numbers, so there is a hidden danger to its stability.

Write QPS exception occurred in a few key of business, resulting in errors such as speed limit in etcd cluster.

After restarting and upgrading etcd, you need to manually check the cluster health from multiple dimensions.

During the process of changing the etcd cluster, the operation error may lead to the split of the etcd cluster.

Therefore, in order to effectively govern the etcd cluster, we summarize these potential pitfalls into automated checking items, such as:

How to efficiently monitor etcd data inconsistency?

How to find the big key-value in time?

How to find the abnormal growth of key number through monitoring in time?

How to monitor exception writing to QPS in time?

How to automate the health detection of the cluster from multiple dimensions and feel more at ease to change?

How to feed these etcd best practices back to the governance of large-scale etcd clusters in the existing network?

The answer is inspection.

We build a multi-dimensional and scalable inspection system based on Kubernetes expansion mechanism and rich etcd knowledge and experience, which helps us to efficiently manage ten thousand-level clusters and actively discover potential hidden dangers.

Why do we build the etcd platform based on the Kubernetes extension mechanism?

Introduction to etcd Cloud Native platform

To solve a series of pain points in our business, our design goals for etcd cloud native platform are as follows:

Observability. Cluster creation and migration processes support visualization, you can view current progress at any time, and support pause, rollback, grayscale, batch, and so on.

High development efficiency. Fully reuse the community's existing infrastructure components and platforms, focus on business, rapid iteration, and efficient development.

High availability. Each component has no single point and can be expanded in parallel. The migration module preempts tasks through distributed locks and can migrate concurrently.

Scalability. Abstract and plug-in migration objects, migration algorithms, cluster management, scheduling strategies and patrol inspection strategies to support multiple Kubernetes cluster types, multiple migration algorithms, multiple cluster types (CVM/ containers, etc.), multiple migration strategies, multiple Kubernetes versions, and multiple patrol inspection strategies.

Looking back at our design goals, observability and high development efficiency are particularly matched with Kubernetes and its declarative programming, as detailed below.

Observability. Real-time migration progress function based on Event, which can view, start and pause all kinds of tasks through kubectl and visual container console.

High development efficiency. The REST API in Kubernetes is elegantly designed. After defining the custom API, the SDK is generated automatically, which greatly reduces the development workload and can focus on the system development in the business domain. At the same time, the automatic monitoring and backup module can be developed based on the existing components of the Kubernetes community to meet our functions and solve the pain point.

Kubernetes is a highly scalable and configurable distributed system with rich extension patterns and points in each module. After choosing the Kubernetes-based programming mode, we need to abstract the etcd cluster, migration task, monitoring task, backup task, migration strategy and so on into Kubernetes custom resources to implement the corresponding controller.

The following is an architecture diagram of the etcd cloud native platform.

Taking the creation and allocation of etcd clusters as an example, we will briefly introduce the principles of the etcd platform:

To create an etcd cluster through kubectl or a visual Web system is essentially to submit an EtcdCluster custom resource

Etcd-apiserver writes CRD to independent etcd storage. After etcd-lifecycle operator listens to the new cluster, according to the backend Provider declared by EtcdCluster, choose to create an etcd cluster based on CVM Provider or containerized.

After the cluster is created, etcd-lifecycle operator will also add a series of backup policies, monitoring policies, and patrol policies, which are essentially a series of CRD resources.

When the business needs to assign etcd clusters, the scheduling service gets a series of candidate clusters that meet the business conditions after the screening process, so how to return the best etcd clusters to users? Here, we support a variety of evaluation strategies, for example, by the minimum number of connections, it will obtain the number of connections of the cluster from the Prometheus through the API of Kubernetes, and give priority to returning the cluster with the minimum number of connections to the business, that is, the cluster just created will be allocated immediately.

Case introduction of etcd Inspection

How to add a rule to the patrol system?

A patrol rule actually corresponds to a CRD resource, as shown in the following yaml file, which means to add a data differentiation inspection policy to the cluster gz-qcloud-etcd-03.

ApiVersion: etcd.cloud.tencent.com/v1beta1kind: EtcdMonitormetadata: creationTimestamp: "2020-06-15T12:19:30Z" generation: 1 labels: clusterName: gz-qcloud-etcd-03 region: etcd-life-cycle-operator name: gz-qcloud-etcd-03-etcd-node-key-diff namespace: gzspec: clusterId: gz-qcloud-etcd-03 metricName: etcd-node-key-diff metricProviderName: cruiser name: gz-qcloud-etcd-03 productName: tke region: gzstatus: Records:-endTime: "2021-02-25T11:22:26Z" message: collectEtcdNodeKeyDiff Etcd cluster gz-qcloud-etcd-03,total key num is 122143 NodeKeyDiff is 0 startTime: "2021-02-25T12:39:28Z" updatedAt: "2021-02-25T12:39:28Z"

After the yaml file is created, the patrol service will execute this inspection policy and expose the relevant metrics to be collected by the Prometheus service. The final effect is as follows.

How to quickly perceive anomalies and achieve rapid disposal and even self-healing?

Based on stable TKE cloud native Prometheus monitoring links and perfect inspection capabilities, etcd platform has been able to provide various monitoring indicators related to the availability of etcd clusters. However, due to the large number of clusters, numerous indicators, numerous user scenarios, and complex deployment environment, it is difficult to quickly locate the causes of exceptions and achieve rapid disposal and immediate recovery.

In order to improve the ability to perceive anomalies and achieve rapid disposal and self-healing, we are mainly faced with the following problems.

In the face of various specifications of etcd clusters and complicated business application scenarios, how to standardize monitoring and alarm?

The business scenario of etcd is different from that of operation. Based on operational requirements, the access of etcd cluster is standardized, and standardized monitoring indicators are provided for operation. Further standardize the landing alarm according to the standardized business and etcd specifications, so as to realize the operation standardization of monitoring alarm.

In the face of a large number of indicators, how to effectively converge, quickly measure the availability of etcd clusters and perceive anomalies?

When etcd usability is abnormal, the associated monitoring is often different, and there is no single indicator to measure its availability. For this reason, SLO is introduced to effectively reflect etcd service availability, and a multi-dimensional monitoring system is built around SLO to achieve rapid exception awareness and problem location, thus further rapid recovery.

The following will solve the above problems one by one and build an efficient data operation system to realize the rapid perception of anomalies.

Access standardization

The continuous operation and maintenance of etcd operation and maintenance information connected to CRD:etcd is configured through CRD, which fully complies with the Kubernetes specification. The basic information of etcd is defined in Spec, and the service information is expanded in the form of Annotation. A CRD includes all the information needed by etcd operation and maintenance.

Cloud native data solution: open source Prometheus configures collection tasks by configuring Static Config, while TKE cloud native Prometheus makes full use of ServiceMonitor resources provided by open source Prometheus-Operator to configure collection tasks. You only need to configure a few filter tags to realize automatic access to component Metrics. Etcd itself, as a data storage, generally runs outside the operation and management cluster. In order to collect the monitoring indicators of etcd itself, No Selector Service in Kubernetes is used to directly configure the Endpoints of the corresponding etcd node to collect etcd's own Metrics.

Standardization: etcd monitoring indicators introduce three types of tags of products, scenarios and specifications through ServiceMonitor's Relabel capabilities to standardize operation information. The product label reflects the product category to which the etcd service object belongs. The scene tag is obtained by dividing the application scenarios of etcd. The specifications are divided into three categories of small,default,large according to the specifications of etcd nodes and user usage.

Unified alarm standard: through standardized implementation, alarm rules no longer rely on a large number of regular matching implementation. The threshold of corresponding alarm metrics can be determined through scenarios and specifications, and alarm rules can be configured by combining alarm indicator expressions. For new alarm rules, new alarm rules can be added without changing existing alarm rules through effective segmentation of scenarios and specifications. At the same time, the scene and specification label brought into the internal self-developed alarm system can reflect the crowd to deal with the alarm, realize the directional push of the alarm, grade the alarm, and improve the accuracy of the alarm.

The above standardization process is not only suitable for cloud native components, but also for components running on machines in binary system, you can also collect corresponding indicators through self-built No Selector Service. After the components have determined the operation class tags according to the operational information such as usage scenarios, the Relabel capabilities of ServiceMonitor can quickly link with TKE Cloud native Prometheus to realize monitoring and alarm links and establish a data standardized operation system.

Based on the above standardization process, the existing network operation support for the production of landing etcd, following the production of etcd, and making use of the Relabel capability of ServiceMonitor, the feature of access as operation and maintenance is realized without changing the monitoring layer:

Define access specification: introduce operational tags of business and specifications, and reflect etcd usage scenarios into monitoring metrics according to such tags, providing a data basis for three-dimensional monitoring market. At the same time, alarm rules configuration and operation and maintenance are implemented around such tags.

Direct adaptation of general alarm rules: around operational label services and specifications, combined with monitoring indicators and thresholds, general alarm rules are directly generated to achieve different dimensions of alarm.

Analysis view: based on the business scenario, combined with different monitoring indicators, directly apply the standardized monitoring view to generate the etcd monitoring market of the business dimension.

Introducing SLO into the Construction data Operation system for SLO

How to abstract a SLO:SLO, that is, the goal of service level, is mainly internal-oriented and used to measure the quality of service. Before determining SLO, you must first determine SLI (Service level indicator). The service is user-oriented, so an important measure is the user's perception of the service, among which the error rate and delay perception are the most obvious. At the same time, the service itself and the third-party service on which the service depends will also determine the quality of service. Therefore, for etcd services, the three elements of SLI can be determined as request error rate and delay, whether there is Leader and node disk IO. To some extent, the node disk IO is reflected in the error rate and latency of read operations, and the SLI is further layered into etcd availability and read-write availability. Combined with the real-time computing ability of Prometheus, the calculation formula of etcd SLO can be determined preliminarily.

The calculation of SLO: SLO is used to measure the quality of service. The quality of service is determined by user perception, service status and dependent underlying services. Therefore, SLO consists of latency based on etcd core interface RPC (Range/Txn/Put, etc.), disk IO, whether there is Leader and related inspection indicators.

SLO operation plan: through the analysis of etcd services, a preliminary SLO calculation formula and landing specific SLO indicators are obtained, but it is only a preliminary implementation. SLO needs to improve the accuracy of SLO by comparing the actual abnormal situation and constantly correcting it. After a period of observation and revision, SLO indicators are becoming more and more accurate, gradually forming the following operation mode, through SLO linkage monitoring, alarm and existing network problems to improve operational efficiency and improve active service capability. After a period of operation, SLO alarm exposed the problem in time through telephone alarm in several abnormal situations, and realized the active discovery of anomalies.

TKE Cloud Native Prometheus Landing SLO introduces Recording Rules

Etcd availability, latency and other key metrics for building SLO have been collected through TKE cloud native Prometheus. Relying on the computing power of Promethues, SLO computing can be realized. However, due to the large volume of SLO, large etcd volume and large SLO computing latency, there are many breakpoints.

Recording Rules is the recording rule of Prometheus. Through this ability, an operation expression can be set in advance, and the result will be saved as a new set of time series data. In this way, the complex SLO calculation formula can be decomposed into different units, the computing pressure can be dispersed, the data breakpoints can be avoided, and because the calculation results have been saved, the query speed of SLO historical data is very fast. At the same time, Promethues updates the recording rules through the received SIGNUP semaphores, so the overloading of recording rules is real-time, which is beneficial to constantly modify the calculation formula and optimize SLO in the process of SLO practice.

Construction of data value operation system

Through the landing of SLO, the monitoring alarm of etcd platform relies on SLO to achieve the unification of the entrance. In view of the various use scenarios of etcd, difficulties in daily troubleshooting, and problem analysis, SLO rapid troubleshooting and stereoscopic SLO monitoring are established around the SLO monitoring system, as shown in the following figure.

Operation demand

Fundamental confirmation: through monitoring, you can get an overall overview of etcd, such as capacity information, component stability, service availability, etc.

Characteristics of different scenarios: different application scenarios have different priorities and different monitoring dimensions, so the monitoring market should be able to reflect the characteristics of different scenarios.

Operation and maintenance troubleshooting: when the underlying IAAS layer resources jitter, quickly identify the affected etcd cluster, quickly determine the impact surface in case of failure, and further confirm the cause of the fault through the alarm view.

Three-dimensional monitoring

The etcd platform monitoring view is shown in the following figure, which is generally divided into first-level, second-level, third-level and troubleshooting views. The first level is the monitoring market, the second level is divided into three scenarios, and the third level is single cluster monitoring, which is the key to the specific problem. The troubleshooting view interacts with etcd and Kubernetes to achieve two-way query.

First-level monitoring view: SLO is calculated based on a variety of monitoring indicators, which can effectively measure the availability of etcd, play the role of convergence monitoring indicators, and achieve a unified entrance. Establish a multi-regional monitoring market according to SLO, which can understand the etcd situation as a whole and quickly identify the fault impact surface.

Second-level monitoring view: according to the etcd application scenario, second-level monitoring consists of business, key customers and other scenarios to achieve the characteristics and requirements of different scenarios, business reflects the overall availability of each region, and can realize whether each business region has sufficient etcd resources. Major customers need to reflect their scale in terms of capacity, and also need to take into account the situation that they are open to customers.

Level 3 monitoring view: level 3 monitoring is a single cluster monitoring view. Through this view, you can identify the specific problems with etcd and provide a basis for fault recovery.

SLO troubleshooting monitoring view: etcd is the underlying storage service of Kubernetes. In the process of troubleshooting, etcd and Kubernetes often need two-way confirmation. In order to improve the efficiency of troubleshooting, SLO troubleshooting monitoring consists of forward query and reverse query view of etcd and Kubernetes clusters.

Operational effectiveness

SLO monitoring system basically covers all operation scenarios and plays a key role in the actual operation process for many times.

Underlying IAAS jitter: quickly identify the impact surface through first-level monitoring, and further confirm the affected etcd cluster in different scenarios, which can quickly determine the impact surface.

Problem location: after receiving the corresponding SLO alarm, you can determine the cause of the SLO alarm through three-level monitoring, confirm the impact indicators, and achieve rapid fault recovery. At the same time, the positive and negative query of etcd and Kubernetes not only facilitates etcd problem identification, but also is a sharp weapon for Kubernetes problem confirmation.

Active service: through SLO monitoring, the market discovers etcd anomalies many times in advance, and actively feeds back to the relevant teams of upper-level services, effectively strangling service failures in the cradle.

Self-healing ability: the failure of etcd nodes will affect the availability of etcd. Through SLO monitoring and alarm, you can quickly sense anomalies. At the same time, relying on the advantage of containerized deployment, the nodes of the production etcd cluster run in the form of Pod. When abnormal nodes occur, abnormal POD will be automatically removed and new nodes will be added to achieve fault self-healing without users' awareness.

The above is how to build an etcd monitoring platform in a 10,000-level Kubernetes cluster scenario. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.