How to use Prometheus to monitor a Kubernetes cluster of 100, 000 container 07/12 Update SLTechnology News&Howtos

How to use Prometheus to monitor a Kubernetes cluster of 100, 000 container

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use Prometheus to monitor a 100, 000 container Kubernetes cluster". The content in the article is simple and clear, and it is easy to learn and understand. please follow the editor's train of thought to study and learn "how to use Prometheus to monitor a 100, 000 container Kubernetes cluster".

Prometheus

Relying on its strong stand-alone performance, flexible PromSQL and active community ecology, Prometheus has gradually become the core monitoring component in the cloud native era, and is used by major developers around the world to monitor their core business.

However, in the face of large-scale monitoring targets (tens of thousands of series), because the native Prometheus only has a stand-alone version and does not provide clustering function, developers have to increase the configuration of the machine to meet the increasing memory of Prometheus.

Single machine performance bottleneck

Our pressure test of a stand-alone Prometheus is used to detect the reasonable load of a single Prometheus fragment. There are two targets for the pressure test.

Determine the relationship between the number of target and Prometheus load

Determine the relationship between the number of series and Prometheus load

Target correlation

We keep the total series unchanged at 1 million, and observe the change of Prometheus load by changing the number of target. Pressure test result

Target quantity CPU (core) mem (GB) 1000.174.65000.194.210000.163.950000.34.6

From the table, we find that the effect of the change in the number of target on the Prometheus load is not strongly related. In the case of a 50-fold increase in the number of target, there is a small increase in CPU consumption, but almost no change in memory.

Series correlation

We keep the number of target unchanged and observe the load change of Prometheus by changing the total number of series.

Pressure test result

Number of series (ten thousand) CPU (core) mem (GB) query 1000 series 15m data (s) 1000.1913.150.23000.93920.141.65002.02630.571.5

From the table, the load of Prometheus is greatly affected by series, and the more series, the greater the resource consumption.

When the series data exceeds 3 million, the Prometheus memory grows significantly and requires a machine with larger memory to run.

In the process of stress testing, we use the tool to generate the expected number of series. The length of each label and the length of each label generated by the tool are small, fixed at about 10 characters. Our purpose is to observe the change of relative load. In actual production, due to the different length of label and the consumption of service discovery mechanism, the load consumed by the same number of series will be much higher than that in stress testing.

Existing clustering scheme

In view of the performance bottleneck of stand-alone Prometheus in large-scale data monitoring, there are some fragmentation schemes in the community, which mainly include the following.

Hash_mod

Prometheus officially supports hash the collected data through the Relabel mechanism, fragment the collected data by specifying different moduleID in the configuration files of different Prometheus instances, and then summarize the data uniformly through federation, Thanos and other methods. As shown in the figure below, readers can also refer to [official documents] directly.

Profile segmentation

Another method is to segment the job level according to the business. Different Prometheus uses completely independent collection configuration, which includes different job.

Problems in the above scheme

Whether it is the way of hash_mod or the way of profile segmentation, its essence is to split the data into multiple collection configurations, which are collected by different Prometheus. Both have the following shortcomings.

* * know something about pre-monitoring data: * * the prerequisite for using the above method is that users must have some understanding of the data that will be reported by the monitoring object, for example, they must know that the monitoring object will report a label for hash_mod, or must know the overall size of different job before dividing the job.

* * instance load imbalance: * * although the above solutions are expected to scatter data to different Prometheus instances, actually hash_mod with certain label values or simply dividing by job does not guarantee that the final number of series collected by each instance is balanced, and the instance still has the risk of taking up too much memory.

* * configuration files are intrusive: * * users must modify the original configuration file, add Relabel-related configurations, or divide a configuration file into multiple parts. Because the configuration file is no longer single and new, it is much more difficult to modify the configuration.

* * unable to dynamically scale up and down: * * since the configuration in the above scheme is specially designed according to the data scale of the actual monitoring target, there is no unified expansion and reduction scheme, which can increase the number of Prometheus when the data scale increases. Of course, it is possible for users to write expansion tools according to the actual situation of their business, but this way can not be reused between different services.

* * part of the API is no longer normal: * * the above solution breaks the data into different instances, and then aggregates the global monitoring data through federation or Thanos, but without additional processing, some of the Prometheus native API cannot get the correct value, the most typical is / api/v1/targets, and the global targets value cannot be obtained under the above solution.

The principle and design goal of Kvass

In order to solve the above problems, we hope to design a non-invasive clustering scheme, which shows users a virtual Prometheus that is consistent with the native Prometheus configuration file, API compatible, scalable and scalable. Specifically, we have the following design goals.

* * non-intrusive, single configuration file: * * what we want users to see is a native configuration file without any special configuration.

No need to perceive monitoring objects: we hope that users no longer need to know the collected objects in advance and do not participate in the clustering process.

* * instance load is balanced as much as possible: * * We want to divide the collection tasks according to the actual load of the monitoring target, so as to make the instance as balanced as possible.

* * dynamic capacity expansion and reduction: * * We hope that the system can dynamically expand and scale up the capacity according to the changes in the size of the collected objects, so that the data is constantly counted and not missing.

* * compatible with core PrometheusAPI:**. We hope that some core API, such as the / api/v1/target API mentioned above, is normal.

Architecture

Kvass consists of several components. The following figure shows the architecture diagram of Kvass. We use Thanos in the architecture diagram. In fact, Kvass is not strongly dependent on Thanos and can be replaced with other TSDB.

Kvass sidecar: used to receive collection tasks sent by Coordinator, generate new configuration files to Prometheus, and maintain target load.

Kvass coordinator: this component is the central controller of the cluster, which is responsible for service discovery, load detection, targets distribution, etc.

Thanos component: only Thanos sidecar and Thanos query are used in the figure to summarize the fragmented data to get a unified data view.

Coordinator

Kvass coordinaor will first do service discovery on the collection target instead of Prometheus, and obtain the target list that needs to be collected in real time.

These target,Kvass coordinaor will be responsible for load detection and evaluate the number of series for each target. Once the target load is detected successfully, Kvass coordinaor will allocate the target to a shard with a load below the threshold in the next calculation cycle.

Kvass coordinaor is also responsible for scaling up and downsizing the sharding cluster.

Service discovery

Kvass coordinaor refers to the service discovery code of native Prometheus to implement the service discovery capability compatible with Prometheus. The targets,Coordinaor to be crawled for service discovery will process the relabel_configs in its application configuration file and get the processed targets and its label collection. The target obtained after service discovery is sent to the load detection module for load detection.

Load detection

The load detection module obtains the processed targets from the service discovery module, grabs the target combined with the crawling configuration (such as proxy, certificate, etc.) in the configuration file, and then parses and calculates the crawling results to obtain the series scale of the target.

The load detection module does not store any captured index data, but only records the load of target. Load detection only detects target once and does not maintain the load changes of subsequent target. The load information of long-running target is maintained by Sidecar, which we will introduce in later chapters.

Target allocation and expansion

In the section on Prometheus performance bottlenecks, we introduced that Prometheus's memory is related to series, specifically, Prometheus's memory is directly related to its head series. Prometheus caches the series information of the recently collected data (default is 2 hours) in memory. If we can control the number of head series in each shard, we can effectively control the memory usage of each shard. To control head series is actually to control the list of target currently collected by the shard.

Based on the above idea, Kvass coordinaor will periodically manage the list of target currently collected for each shard: assign a new target and delete an invalid target.

In each cycle, Coordinaor first obtains the current running status from all shards, including the number of series in the current memory of the shard and the list of target currently being crawled. Then the global target information obtained from the service discovery module is processed as follows

If the target has been grabbed by a shard, it will continue to be assigned to him, and the number of series of the shard remains the same.

If the target does not have any shard capture, obtain its series from the load detection module (skip if it is not finished, and continue the next cycle), select one from the shard that is still lower than the threshold value after adding the series of the target in memory, and assign it to him.

If all the current shards cannot accommodate all the targets to be allocated, the capacity is expanded, and the number of capacity expansion is proportional to the total amount of global series.

Target migration and downsizing

During the operation of the system, the target may be deleted. If the target of a shard is deleted for more than 2 hours, the head series in the shard will be reduced, that is, part of the target will be idle, because the target is allocated to different shards. If a large number of target are deleted, the memory footprint of many shards will be very low. In this case, the resource utilization of the system is very low, and we need to scale down the system.

When this happens, Coordinaor will migrate the target, that is, the target in the shard with a larger sequence number (the shard will be numbered from 0) will be transferred to the shard with a lower sequence number, and finally the load of the part with the lower sequence number will become higher, and the part with the higher sequence number will be completely idle. If the storage uses thanos and the data is stored in cos, the free shards will be deleted after 2 hours (ensure that the data has been transferred to the cos).

Multiple copies

Currently, Kvass shards can only be deployed in StatefulSet mode.

Coordinator will obtain all shard StatefulSet through label selector, each StatefulSet is considered to be a copy, the Pod with the same number in StatefulSet will be considered as the same shard group, and the Pod of the same shard group will be assigned the same target and expected to have the same load.

/ api/v1/targets interface

As mentioned above, Coordinator makes service discovery according to the configuration file and gets the target list, so Coordinator can actually get the return result set required by / api/v1/targets API. However, because Coordinator only does service discovery and does not actually collect, the collection status of target (such as health status, last collection time, etc.) cannot be directly known.

When Coordinator receives the / api/v1/targets request, he will ask Sidecar (if target has been allocated) or probe module (target has not been allocated) for the target collection status based on the collection of target discovered by the service, and return the correct / api/v1/targets result after synthesis.

Sidecar

The previous section introduced the basic functions of Kvass coordinaor. For the system to work properly, we need the cooperation of Kvass sidecar. Its core idea is to change all the service discovery modes in the configuration file to static_configs and directly write the target information that has been relabel into the configuration, so as to eliminate sharding service discovery and relabel behavior, and collect only part of the target.

Each shard will have a Kvass sidecar, and its core functions include accepting the list of target responsible for this shard from Kvass coordinator and generating a new configuration file for the Prometheus of the shard to use. In addition, Kvass sidecar hijacks fetch requests to maintain the latest target load. Kvass sidecar also acts as a gateway for PrometheusAPI to correct some of the request results.

Profile generation

After service discovery, relabel and load detection, Coordinaor allocates target to a shard and sends target information to Sidecar, including

The address of target

Estimated series value of target

Hash value of target

The label collection after processing the relabel.

Based on the target information obtained from Coordinator and combined with the original configuration file, Sidecar generates a new configuration file for Prometheus to use, which has been changed as follows.

Change all service discovery mechanisms to static_configs mode and write directly to the target list, with each target containing the label value after relabel

Since the target has already been relabel, delete the relabel_configs entry in the job configuration, but still keep the metrics_rebale_configs

Replace all scheme fields in the label of target with http, and add the original schme to the label collection in the form of request parameters

Add the job_name of target to the label collection in the form of request parameters * inject proxy_url to proxy all crawl requests to Sidecar

Let's look at an example, if the original configuration is a kubelet collection configuration

Global: evaluation_interval: 30s scrape_interval: 15s job_name: kubelet honor_timestamps: true metrics_path: / metrics scheme: https kubernetes_sd_configs:-role: node bearer_token: xxx tls_config: insecure_skip_verify: true relabel_configs:-separator:; regex: _ meta_kubernetes_node_label_ (. +) replacement: $1 action: labelmap

A new configuration file will be generated by injection

Global: evaluation_interval: 30s scrape_interval: 15s job_name: kubelet honor_timestamps: true metrics_path: / metrics scheme: https proxy_url: http://127 .0.0.1: 8008 # all crawl requests are proxied to Sidecar static_configs:-targets:-111.111.111.111pur10250 labels: _ _ address__: 111.111.111purl 10250 _ metrics_path__: / metrics_ _ param__hash: "15696628886240206341" _ _ param__jobName: kubelet _ _ param_ _ scheme: https # Save the original scheme__ scheme__: http # set the new scheme This will make all fetch requests proxied to Sidecar are http requests # the following is the label collection beta_kubernetes_io_arch: amd64 beta_kubernetes_io_instance_type: QCLOUD beta_kubernetes_io_os: linux cloud_tencent_com_auto_scaling after relabel_configs processing _ group_id: asg-b4pwdxq5 cloud_tencent_com_node_instance_id: ins-q0toknxf failure_domain_beta_kubernetes_io_region: sh failure_domain_beta_kubernetes_io_zone: "200003" instance: 172.18.1.106 job: kubelet Kubernetes_io_arch: amd64 kubernetes_io_hostname: 172.18.1.106 kubernetes_io_os: linux

The newly generated configuration file above is the configuration file really used by Prometheus. Sidecar generates the configuration through the target list issued by Coordinator, which allows Prometheus to collect selectively.

Grab hijacking

In the above configuration generation, we will inject proxy into the configuration of job, and in the label of target, scheme will be set to http, so all crawl requests of Prometheus will be proxied to Sidecar. This is because Sidecar needs to maintain the new series size of each target, which can be used as a reference for target migration after Coordinator review.

As we can see from the above configuration generation, there are several additional request parameters that will be sent to Sidecar as well

The hash value of hash:target, which is used by Sidecar to identify which target is fetching request. The hash value is calculated by Coordinator based on the label collection of target and passed to Sidecar.

JobName: the crawl request under which job, which is used by Sidecar to initiate a real request to the crawling target based on the request configuration of job in the original configuration file (such as the original proxy_url, certificate, etc.).

Scheme: scheme here is the final protocol value obtained by target through relabel operation. Although there is a scheme field in the job configuration file, the Prometheus configuration file still supports specifying the request protocol of a target through relabel. In the process of generating the new configuration above, we save the real scheme to this parameter, and then set all the scheme to http.

With the above parameters, Sidecar can initiate a correct request for the crawling target and get the monitoring data. After counting the series size captured by target, Sidecar will copy the monitoring data to Prometheus.

API Agent

Due to the existence of Sidecar, some API requests to Prometheus need to be specially handled, including

/-/ reload: since the configuration file actually used by Prometheus is generated by Sidecar, Sidecar needs to process this interface and call the /-/ reload API of Prometheus after successful processing.

/ api/v1/status/config: this interface needs to be handled by Sidecar and the original configuration file is returned.

Other interfaces are sent directly to Prometheus.

Global data View

Because we spread the collection targets into different shards, so that the data of each shard is only part of the global data, we need to use additional components to summarize and duplicate all the data (in the case of multiple copies). Get a global data view.

Take thanos as an example

Thanos is a very good solution, by adding thanos components, you can easily get the global data view of the kvass cluster. Of course, we can also use other TSDB schemes, such as influxdb,M3, by adding remote writer configuration.

Use examples

In this section, we take a visual look at the effects of Kvass through a deployment example, where the relevant yaml files can be found in https://github.com/tkestack/kvass/tree/master/examples readers can clone the project locally and enter examples.

Git clone https://github.com/tkestack/kvass.gitcd kvass/examples deployment data Generator

We provide a metrics data generator that can be specified to generate a certain number of series, and in this example, we will deploy six copies of the metrics generator, each generating 10045 series (of which 45 series is the metrics of golang).

Kubectl create-f metrics.yaml deployment kvass

Now we deploy a Kvass-based Prometheus cluster to collect metrics for these six metrics generators.

First of all, let's deploy the configuration related to rbac.

Kubectl create-f kvass-rbac.yaml

Then deploy a Prometheus config file, which is our original configuration, in which we use kubernetes_sd to do service discovery

Kubectl create-f config.yaml

The configuration is as follows

Global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: customscrape_configs:- job_name: 'metrics-test' kubernetes_sd_configs:-role: pod relabel_configs:-source_labels: [_ meta_kubernetes_pod_label_app_kubernetes_io_name] regex: metrics action: keep-source_labels: [_ meta_kubernetes_pod_ip] action: replace regex: (. *) Replacement: ${1}: 9091 target_label: _ _ address__-source_labels:-_ _ meta_kubernetes_pod_name target_label: pod

Now let's deploy Kvass coordinator.

Kubectl create-f coordinator.yaml

In the startup parameters of Coordinator, we set the maximum number of head series per shard not to exceed 30000.

-- shard.max-series=30000

We can now deploy Prometheus with Kvass sidecar, where we only deploy a single copy

Kubectl create-f prometheus-rep-0.yaml deployment thanos-query

To get the global data, we need to deploy a thanos-query

Kubectl create-f thanos-query.yaml view results

According to the above calculation, the monitoring target has a total of 6 target and 60270 series. According to our setting, each shard cannot exceed 30000 series, then 3 shards are expected.

We found that Coordinator successfully changed the number of copies of StatefulSet to 3.

When we look at the number of series in a single shard, we find that there are only two target.

We then look at the global data through thanos-query and find that the data is complete (where metrics0 is the name of the indicator generated by the indicator generator)

Cloud native monitoring

Tencent Cloud Container team has further optimized the design idea of Kvass to build a high-performance cloud native monitoring service that supports multi-cluster cloud. The product has been officially tested.

Large cluster monitoring

In this section, we will directly use the cloud native monitoring service to monitor a large real cluster and test the ability of Kvass to monitor large clusters.

Cluster scale

The size of our associated cluster is roughly as follows

1060 nod

64000 + Pod

96000 + container

Collection configuration

We directly use the collection configuration added by default in the associated cluster by the cloud native monitoring service, which now includes the mainstream monitoring metrics of the community:

Kube-state-metrics

Node-exporer

Kubelet

Cadvisor

Kube-apiserver

Kube-scheduler

Kube-controler-manager

Test result

Total 3400+target, 2700 + million series

A total of 17 fragments were expanded.

Each shard series is stable below 200w.

The consumption of each slice is about 6-10G.

The default Grafana panel provided by cloud native monitoring can also be pulled normally.

Targets list can also be pulled normally.

Multi-cluster monitoring

It is worth mentioning that the cloud native monitoring service not only supports monitoring a single large-scale cluster, but also monitors multiple clusters with the same instance, and supports collection and alarm template features. You can send the collection alarm template to each cluster in each region with one click, completely saying goodbye to the problem of repeatedly adding configurations to each cluster.

Thank you for reading, the above is the content of "how to use Prometheus to monitor 100, 000 container Kubernetes clusters". After the study of this article, I believe you have a deeper understanding of how to use Prometheus to monitor 100, 000 container Kubernetes clusters, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.