What are the pitfalls in the practice of high-availability Prometheus architecture 07/04 Update SLTechnology News&Howtos

What are the pitfalls in the practice of high-availability Prometheus architecture

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what are the pits in the practice of high-availability Prometheus architecture". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Several principles

Monitoring is the infrastructure, the purpose is to solve the problem, not just to do it all, especially unnecessary indicators collection, waste of human and storage resources (except for To B commercial products).

The alarm that needs to be dealt with is sent out, and the alarm sent out must be dealt with.

Simple architecture is the best architecture, business systems are down, monitoring can not be hung up. Google SRE also says to avoid using Magic systems, such as machine learning alarm thresholds, automatic repair, and so on. There are different opinions on this point. It feels like many companies are engaged in intelligent AI operation and maintenance.

Limitations of Prometheus

Prometheus is Metric-based monitoring and does not apply to Logs, Event, and Tracing.

Prometheus is the Pull model by default. Plan your network properly and try not to forward it.

For clustering and horizontal expansion, there is no silver bullet in the government and the community, so we need to choose Federate, Cortex, Thanos and other programs reasonably.

In general, the availability of the monitoring system is greater than consistency, and the data loss of some copies is tolerated to ensure the success of the query request. This will be mentioned later when it is said that Thanos is heavy.

Prometheus does not necessarily guarantee the accuracy of the data. The inaccuracy here means that functions such as rate and histogram_quantile will make statistics and inferences, resulting in some counterintuitive results, which will be expanded in detail later. Second, if the query range is too long, downsampling will inevitably lead to the loss of data accuracy, but this is the characteristic of time series data, which is also different from the log system.

Exporter commonly used in Kubernetes Cluster

Prometheus belongs to the CNCF project and has a complete open source ecology. Unlike traditional Agent monitoring like Zabbix, it provides a wealth of Exporter to meet your needs. You can see official and unofficial Exporter here [2]. If you still do not meet your needs, you can also write your own Exporter, simple and convenient, free and open, which is an advantage.

But too open will bring the cost of type selection and trial and error. What used to be done with just a few lines of configuration in Zabbix Agent, now you'll need a lot of Exporter to do it. But also to maintain and monitor all Exporter. Especially when upgrading the Exporter version, it is very painful. There will be a lot of bug in unofficial Exporter. This is a deficiency in use, and of course it is also the design principle of Prometheus.

All components of Kubernetes Ecology provide / metric interface to provide self-monitoring. Here is a list of what we are using:

CAdvisor: integrated in Kubelet.

Kubelet:10255 is a non-authenticated port and 10250 is an authenticated port.

Apiserver:6443 port, concerned about the number of requests, latency, etc.

Scheduler:10251 port.

Controller-manager:10252 port.

Etcd: such as etcd write read latency, storage capacity, etc.

Docker: you need to enable experimental experimental features and configure metrics-addr, such as container creation time and other metrics.

Kube-proxy: 127exposures by default, port 10249. External acquisition can be changed to 0.0.0.0 snooping, which will expose indicators such as the time it takes to write iptables rules.

Kube-state-metrics:Kubernetes official project to collect meta-information of Pod, Deployment and other resources.

Node-exporter:Prometheus official project, collect machine indicators such as CPU, memory, disk.

Blackbox_exporter:Prometheus official project, network detection, DNS, ping, http monitoring.

Process-exporter: collect process indicators.

NVIDIA Exporter: we have GPU tasks that require GPU data monitoring.

Node-problem-detector: that is, NPD, not Exporter to be exact, but it will also monitor the status of the machine and report that the node has an abnormal taint.

Application layer Exporter:MySQL, Nginx, MQ, etc., depending on business requirements.

There are also custom Exporter in various scenarios, such as log extraction, which will be described later.

Kubernetes Core component Monitoring and Grafana Panel

The state and performance of core components should be paid attention to in the operation of Kubernetes cluster. For example, kubelet, apiserver, etc., based on the indicators of Exporter mentioned above, you can draw the following charts in Grafana:

The template can refer to Grafana Dashboards for Kubernetes Administrators [3] and constantly adjust the alarm threshold according to the operation.

It is mentioned here that although Grafana supports templates capability and can easily select multi-level drop-down boxes, it does not support the configuration of alarm rules in templates mode, related issue [4].

There are a lot of official explanations for this feature, but the latest version still doesn't support it. To borrow a sentence from issue, complain:

It would be grate to add templates support in alerts. Otherwise the feature looks useless a bit.

For the basic usage of Grafana, you can see this article [5].

Acquisition module All in One

In the Prometheus system, Exporter is independent, and each component has its own function, such as machine resources, Node-Exporter,GPU, Nvidia Exporter and so on. But the more Exporter, the greater the pressure of operation and maintenance, especially the resource control and version upgrade of Agent. We try to combine some Exporter, and there are two solutions:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

If you pull N Exporter processes through the main process, you can still update and bug fix with the community version.

Use Telegraf to support various types of Input,N combinations.

In addition, Node-Exporter does not support process monitoring. You can add a Process-Exporter, or use the Telegraf mentioned above, and use the input of procstat to collect process metrics.

Reasonable selection of gold index

There are many indicators collected, which should we pay attention to? Google proposed "four golden signals" in "SRE Handbook": delay, traffic, number of errors, and saturation. In practice, you can use Use or Red methods as a guide, Use for resources, and Red for services.

Use method: Utilization, Saturation, Errors. Such as Cadvisor data

Red method: Rate, Errors, Duration. Such as Apiserver performance indicators

There are three common services in Prometheus collection:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Online services: such as Web services, databases, etc., generally concerned with request rate, delay and error rate, namely RED method

Offline services: such as log processing, message queuing, etc. generally focus on the number of queues, the number of queues in progress, the processing speed and the errors that occur, that is, the Use method.

Batch tasks: very similar to offline tasks, but offline tasks are long-running, batch tasks are run according to plan, such as continuous integration is batch task, corresponding to Job or CronJob in Kubernetes, generally focus on the time spent, the number of errors, etc., because the running cycle is short, it is likely that the run ends before it is collected, so you generally use Pushgateway to push instead.

For practical examples of Use and Red, please refer to the article "Container Monitoring practice-Kubernetes Common Metrics Analysis [6]".

The Index compatibility of cAdvisor in Kubernetes 1.16

In Kubernetes version 1.16, the cAdvisor metric removed the label of pod_Name and container_name and replaced it with Pod and Container. If you used these two label to do query or Grafana drawing, you need to change the SQL. Because we have been supporting multiple versions of Kubernetes, we continue to retain the original * * _ name through the relabel configuration.

Metric_relabel_configs:-source_labels: [container] regex: (. +) target_label: container_name replacement: $1 action: replace-source_labels: [pod] regex: (. +) target_label: pod_name replacement: $1 action: replace

Pay attention to using metric_relabel_configs, not relabel_configs, the replace made after collection.

Prometheus collects external Kubernetes clusters and multi-clusters

If Prometheus is deployed in a Kubernetes cluster, it is very convenient to use the official Yaml, but we need to deploy outside the cluster and run in binary system to collect multiple Kubernetes clusters because of permissions and network.

You don't need a certificate to run in the cluster in Pod mode (In-Cluster mode), but you need to declare a certificate such as token outside the cluster and replace _ _ address__, to use Apiserver Proxy collection. Take cAdvisor collection as an example, and Job is configured as:

-job_name: cluster-cadvisor honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: / metrics scheme: https kubernetes_sd_configs:-api_server: https://xx:6443 role: node bearer_token_file: token/cluster.token tls_config: insecure_skip_verify: true bearer_token_file: token/cluster.token tls_config: insecure_skip_verify: True relabel_configs:-separator: Regex: _ _ meta_kubernetes_node_label_ (. +) replacement: $1 action: labelmap-separator:; regex: (. *) target_label: _ _ address__ replacement: xx:6443 action: replace-source_labels: [_ meta_kubernetes_node_name] separator: Regex: (. +) target_label: _ _ metrics_path__ replacement: / api/v1/nodes/$ {1} / proxy/metrics/cadvisor action: replace metric_relabel_configs:-source_labels: [container] separator:; regex: (. +) target_label: container_name replacement: $1 action: replace-source_labels: [pod] separator: Regex: (. +) target_label: pod_name replacement: $1 action: replace

Bearer_token_file needs to be generated in advance. Refer to the official documentation for this. Remember to decode base64.

For cAdvisor, _ _ metrics_path__ can be converted to / api/v1/nodes/$ {1} / proxy/metrics/cadvisor, representing Apiserver proxy to Kubelet.

If the network is connected, you can also directly use 10255 of Kubelet as a target, which can be directly written as: ${1}: 10255/metrics/cadvisor, which represents a direct request for Kubelet. When the scale is large, it also reduces the pressure on Apiserver, that is, the service discovers that it uses Apiserver and does not take Apiserver for collection.

Because cAdvisor is an exposed host port, the configuration is relatively simple. If a Deployment like kube-state-metric is exposed in the form of Endpoint, it should be written as follows:

-job_name: cluster-service-endpoints honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: / metrics scheme: https kubernetes_sd_configs:-api_server: https://xxx:6443 role: endpoints bearer_token_file: token/cluster.token tls_config: insecure_skip_verify: true bearer_token_file: token/cluster.token tls_config: insecure_skip_ Verify: true relabel_configs:-source_labels: [_ _ meta_kubernetes_service_annotation_prometheus_io_scrape] separator: Regex: "true" replacement: $1 action: keep-source_labels: [_ _ meta_kubernetes_service_annotation_prometheus_io_scheme] separator:; regex: (https?) Target_label: _ _ scheme__ replacement: $1 action: replace-separator:; regex: (. *) target_label: _ _ address__ replacement: xxx:6443 action: replace-source_labels: [_ _ meta_kubernetes_namespace, _ _ meta_kubernetes_endpoints_name, _ _ meta_kubernetes_service_annotation_prometheus_io_port] separator:; regex: (. +) (. +); (. *) target_label: _ _ metrics_path__ replacement: / api/v1/namespaces/$ {1} / services/$ {2}: ${3} / proxy/metrics action: replace-separator:; regex: _ _ meta_kubernetes_service_label_ (. +) replacement: $1 action: labelmap-source_labels: [_ _ meta_kubernetes_namespace] separator: Regex: (. *) target_label: kubernetes_namespace replacement: $1 action: replace-source_labels: [_ _ meta_kubernetes_service_name] separator:; regex: (. *) target_label: kubernetes_name replacement: $1 action: replace

For the Endpoint type, you need to convert _ _ metrics_path__ to / api/v1/namespaces/$ {1} / services/$ {2}: ${3} / proxy/metrics, replace namespace, svc name port, and so on.

The writing here is only suitable for exporter with interface / metrics. If your exporter is not / metrics interface, you need to replace this path. Or, as we do, unified constraints all use this address.

The source of _ _ meta_kubernetes_service_annotation_prometheus_io_port here is the annotation that was written when exporter was deployed. Most articles only mention prometheus.io/scrape: 'true', but you can also define ports, paths, and protocols. In order to facilitate the replacement processing during the collection.

Some other relabel, such as kubernetes_namespace, are designed to retain the original information and facilitate the filtering criteria for PromQL queries.

If it is multi-cluster, the same configuration can be written several times. Generally, a cluster can be configured with three types of Job:

Role:node, including summary, kube-proxy, Docker of cAdvisor, node-exporter, kubelet, etc.

Role:endpoint, including kube-state-metric and other custom Exporter

Acquisition of GPU Index

Nvidia-smi can view GPU resources on the machine, while cAdvisor actually exposes Metric to indicate the container's use of GPU.

Container_accelerator_duty_cycle container_accelerator_memory_total_bytes container_accelerator_memory_used_bytes

If you want more detailed GPU data, you can install dcgm exporter, but only Kubernetes 1.13 can support it.

Change the display time zone of Prometheus

To avoid time zone confusion, Prometheus specifically uses Unix Time and Utc for display in all components. Setting the time zone in the configuration file is not supported, and the native / etc/timezone time zone cannot be read.

In fact, this restriction does not affect use:

If you adjust the interface and get the timestamp in the data, you can do whatever you want.

If you feel uncomfortable because the UI that comes with Prometheus is not local time, the new version of Web UI in version 2.16 has introduced the option of Local Timezone, as shown in the figure below.

If you still want to change the Prometheus code to suit your time zone, you can refer to this article [7].

For a discussion of timezone, you can see this issue [8].

How to collect the Metric of RS behind LB

If you have a load balancer LB, but the Prometheus on the network can only access the LB itself, but not the later RS, how to collect the Metric exposed by RS?

RS services plus Sidecar Proxy, or native Proxy components are added to ensure that Prometheus can access it.

LB adds / backend1 and / backend2 requests to be forwarded to two separate backends, and then Prometheus accesses LB collection.

The choice of version

The latest version of Prometheus is 2.16 Magi Prometheus, which is still iterating, so try to use the latest version, and 1.x version need not be considered.

There is a set of experimental UI available on version 2.16 to check the status of TSDB, including Label and Metric of Top 10.

Prometheus large memory problem

As the scale increases, the CPU and memory required by the Prometheus will increase, and the memory usually reaches the bottleneck first. At this time, either memory is added or cluster fragmentation reduces the standalone target. Here we first discuss the memory problem of the stand-alone version of Prometheus.

Reason:

The memory consumption of Prometheus is mainly due to the storage of Block data every 2 hours. All the data is in memory before the disk is dropped, so it is related to the amount of data collected.

When loading historical data, it goes from disk to memory, and the larger the query range, the larger the memory. There is some room for optimization.

Some unreasonable query conditions can also increase memory, such as Group or large-scale Rate.

How much memory is required for my metric:

The author gives a calculator, sets the index quantity, the collection interval and so on, calculates the theoretical memory value needed by Prometheus: calculation formula [9].

Take one of our Prometheus Server as an example, only 2 hours of data is retained locally, with 950000 Series, which takes up approximately the following memory:

What is the optimization plan:

If the number of Sample exceeds 2 million, do not use a single instance. Do the sharding, and then merge the data through Victoriametrics,Thanos,Trickster and other solutions.

TSDB status can be seen above 2.14.

When querying, try to avoid large-scale queries, pay attention to the proportion of time range and Step, and be cautious in using Group.

If you need an associated query, first think about whether you can add an extra Label to the original data through Relabel, why use Join if a SQL can find out, the time series database is not a relational database.

Prometheus memory footprint analysis:

Through pprof analysis: https://www.robustperception.io/optimising-prometheus-2-6-0-memory-usage-with-pprof

1.x version of memory: https://www.robustperception.io/how-much-ram-does-my-prometheus-need-for-ingestion

Related issue:

Https://groups.google.com/forum/#!searchin/prometheus-users/memory%7Csort:date/prometheus-users/q4oiVGU6Bxo/uifpXVw3CwAJ

Https://github.com/prometheus/prometheus/issues/5723

Https://github.com/prometheus/prometheus/issues/1881

Prometheus capacity planning

Capacity planning in addition to the memory mentioned above, there is also disk storage planning, which is related to the architecture of your Prometheus.

In the case of the Thanos scheme, the local disk can be ignored (2H), and the size of the object storage can be calculated.

Prometheus compresses data that has been buffered in memory into blocks on disk every 2 hours. Including Chunks, Indexes, Tombstones, Metadata, which take up part of the storage space. In general, each sample stored in Prometheus takes up about 1-2 bytes (1.7Byte). You can use PromQL to see how much space each sample takes up on average:

Rate (prometheus_tsdb_compaction_chunk_size_bytes_ Sum [1h]) rate (prometheus_tsdb_compaction_chunk_samples_ Sum [1h]) instance= "0.0.0.0 prometheus_tsdb_compaction_chunk_samples_ 8890", job= "prometheus"} 1.252747585939941

If you roughly estimate the size of the local disk, you can use the following formula:

When the retention time (retention_time_seconds) and sample size (bytes_per_sample) are constant, if you want to reduce the capacity requirements of the local disk, you can only reduce the number of samples per second (ingested_samples_per_second).

View the number of samples currently taken per second:

Rate (prometheus_tsdb_head_samples_appended_ Total [1h])

There are two methods, one is to reduce the number of time series, and the other is to increase the time interval of collecting samples. Considering that Prometheus will compress the time series, the effect of reducing the number of time series is more obvious.

Examples are as follows:

The collection frequency is 30s, the number of machines is 1000Metric, the type of Metric is 6000500026024, about 20 billion, 30g disk.

Only collect the required indicators, such as match [], or the most commonly used indicators under statistics, the indicators with the worst performance.

The above disk capacity does not take into account the wal file, the wal file (Raw Data) states in the Prometheus official document that at least 3 Write-Ahead Log Files will be saved, each up to 128m (more will be found in actual operation).

Wal generates a Block file every 2 hours, and Block files are uploaded for object storage every 2 hours. There is basically no pressure on the local disk.

About the Prometheus storage mechanism, you can see this [10].

Impact on the performance of Apiserver

Of course, if a single Kubernetes cluster is too large, it is usually split, but it is necessary to monitor the progress of Apiserver at any time.

When monitoring the Metric of cAdvisor, Docker and Kube-Proxy, we first select the corresponding port from Apiserver Proxy to the node, which is convenient to set uniformly, but then we pull the node directly, and Apiserver only does service discovery.

The Computational Logic of Rate

The Counter type in Prometheus exists mainly for Rate, that is, the calculation rate, and the simple Counter count does not make much sense, because once the Counter is reset, the total number is meaningless.

Rate automatically handles Counter reset, and Counter usually gets bigger all the time, such as when an Exporter starts and then crashes. It would have been incremented at a rate of about 10 per second, but after only half an hour of running, the rate (x_total [1h]) will return a result of about 5 per second. In addition, any reduction in Counter will also be considered a Counter reset. For example, if the value of the time series is [5mem10re4re6], it is regarded as [5rect 10recource 14re16].

Rate values are rarely accurate. Because the fetching for different targets takes place at different times, jitter will occur with the passage of time. Query_range calculation rarely matches the crawling time perfectly, and the crawling may fail. Faced with such challenges, the design of Rate must be robust.

Rate does not want to capture every increment, because sometimes the increment is lost, such as when the instance dies during the grab interval. If the Counter changes slowly, for example, only a few times per hour, it may lead to [illusion]. For example, if there is a Counter time series with a value of 100 focus rate, I don't know whether these increments are current values, or whether the target has been running for several years and is just beginning to return.

It is recommended that the time of the range vector calculated by Rate be set to at least four times the fetching interval. This ensures that even if the fetch speed is slow and a crawl failure occurs, you can always use two samples. Such problems often arise in practice, so it is important to maintain this resilience. For example, for a 1-minute grab interval, you can use a 4-minute Rate calculation, but it is usually rounded to 5 minutes.

If there are missing data in Rate's time frame, he will speculate based on trends, such as:

For details, please take a look at this video [11].

Counterintuitive P95 statistics

Histogram_quantile is a common function of Prometheus, for example, the P95 response time of a service is often used to measure the quality of service. However, it is difficult to explain what it means, especially for non-technical students, who will encounter a lot of "soul torture".

We often say that the response delay of P95 is 100ms, which actually means that for all the response delays collected, 5% of requests are greater than 100ms and 95% of requests are less than 100ms. The histogram_quantile function in Prometheus receives decimals between 0 and 1. Multiplying this decimal by 100 can easily get the corresponding percentile. For example, 0.95 corresponds to P95, and it can also be higher than the precision of percentile, such as 0.9999.

When you use histogram_quantile to draw a trend chart of response time, you may be asked: why is P95 greater or less than my average?

Just as the median may be larger or smaller than the average, it is entirely possible that P99 is smaller than the average. In general, P99 is almost always larger than the average, but if the data distribution is extreme, the largest 1% may be ridiculously large, driving up the average. One possible example:

1, 1,. 1901 / / A total of 100 pieces of data, the average value = 10 ~ (th) P ~ (99)

Service X is completed by two steps of sequential AMagi B, in which the P99 of X consumes 100ms of P99 and the process of P99 takes 50ms, so what is the P99 time of process B?

Intuitively, because of X=A+B, the answer may be 50ms, or at least less than 50ms. In fact, B can be greater than 50ms, as long as An and B do not happen to meet the largest 1%, B can have a very large P99:

A = 1, 1,... 1, 1, 50, 50 / / 100 pieces of data, P99 pieces 50 B = 1, 1, 1,... 1, 1, 1, 99, 99 / / 100 pieces of data, P99 pieces 99 X = 2, 2,... 1, 51, 51, 100 / / 100 pieces of data, P99 entries 100 pieces of data

If the largest 1% of the A process is close to 100ms, we can also construct a very small B of P99:

A = 50, 50,... 50, 50, 99 / / 100 pieces of data, P99 entries 50 B = 1, 1,... 1, 1, 50 / / 100 pieces of data, P99 pieces 1 X = 51, 51, 51, 100 / / 100 pieces of data, P99 entries 100 pieces of data.

So the only thing we can be sure of from the title is that the P99 of B should not exceed 100ms of A's P99 time-consuming 50ms is actually useless.

There are a lot of similar questions, so for the histogram_quantile function, it may produce some counterintuitive results. The best way to deal with it is to keep trying to adjust your Bucket value to ensure that more request time falls within a more detailed range, so that the request time is statistically meaningful.

Slow query problem

Prometheus provides a custom PromQL as a query statement. When debugging on Graph, it will tell you the return time of this SQL. If it is too slow, you should pay attention to it. There may be something wrong with your usage.

To evaluate the overall response time of Prometheus, you can use this default metric:

Prometheus_engine_query_duration_seconds {}

In general, slow response is caused by improper use of PromQL, or there are problems with indicator planning, such as:

Join is widely used to combine metrics or add label, such as adding some meta label in kube-state-metric and node attribute label in node-exporter to cAdvisor container data, such as counting Pod memory utilization and classifying according to the machine type of the node, or classifying according to the RSS.

When querying a range, the step value of a large time range is very small, resulting in a large number of queries.

Rate will automatically handle the problem of counter reset, which is best done by PromQL. Do not take out all the metadata and do your own rate calculation in the program.

When using rate, range duration must be greater than or equal to step, otherwise some data will be lost.

Prometheus has basic forecasting functions, such as deriv and predict_linear (more accurately) can predict future trends based on existing data.

If the SQL is more complex and time-consuming, you can use record rule to reduce the number of metrics and make the query more efficient, but do not add record to all metrics. More than half of the metric is not very good at querying. At the same time, the values in label should not be added to the name of record rule.

High cardinality problem Cardinality

High cardinality is an unavoidable topic in the database. For a DB like MySQL, cardinality refers to the number of unique values contained in a particular column or field. The lower the cardinality, the more elements are repeated in the column. For the time series database, it is the number of tag values such as tags and label.

For example, if there is an indicator http_request_count {method= "get", path= "/ abc", originIP= "1.1.1.1"} in Prometheus that indicates the number of visits, and method represents the request method, originIP is the client-side IP,method whose enumerated value is limited, but the originIP is infinite, and the permutation and combination of other label is infinite and does not have any associated features, so this high cardinality is not suitable to be used as the label of Metric. Logging should be used instead of Metric monitoring.

The time series database indexes these Label to improve query performance so that you can quickly find values that match all specified tags. If there are too many values, the index is meaningless, especially when doing calculations such as P95, you have to scan a large amount of Series data.

Recommendations for Label in the official documentation:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

How to view the current Label distribution, you can use the TSDB tool provided by Prometheus. You can use the command line to view it, or you can view it on Prometheus Graph above version 2.16.

[work@xxx bin] $. / tsdb analyze.. / data/prometheus/ Block ID: 01E41588AJNGM31SPGHYA3XSXG Duration: 2h0m0s Series: 955372 Label names: 301Postings (unique label pairs): 30757 Postings entries (total label pairs): 10842822....

Top10 high cardinality metric:

Highest cardinality metric names: 87176 apiserver_request_latencies_bucket 59968 apiserver_response_sizes_bucket 39862 apiserver_request_duration_seconds_bucket 37555 container_tasks_state....

High cardinality label:

Highest cardinality labels: 4271 resource_version 3670 id 3414 name 1857 container_id 1824 _ _ name__ 1297 uid 1276 pod...

Find the largest Metric or Job

The number of Metric of top10: divided by Metric name.

Topk (10, count by (_ name__) ({_ name__=~ ". +"}) apiserver_request_latencies_bucket {} 62544 apiserver_response_sizes_bucket {} 44600

The number of Job of top10: divided by Job name.

Topk (10, count by (_ _ name__, job) ({_ name__=~ ". +"}) {job= "master-scrape"} 525667 {job= "xxx-kubernetes-cadvisor"} 50817 {job= "yyy-kubernetes-cadvisor"} 44261

Slow restart of Prometheus and hot loading

When Prometheus restarts, you need to Load the contents of Wal into memory. The longer the retention time is, the larger the Wal file is, and the longer the restart is. This is the mechanism of Prometheus, and there is no way. Therefore, if you can Reload, do not restart. Restart will definitely lead to short-term unavailability. At this time, the high availability of Prometheus is very important.

Prometheus has also optimized the startup time. In version 2.6, the Load speed of Wal has been optimized. It is hoped that the restart time will not exceed 1 minute.

Prometheus provides hot loading capability, but you need to enable the web.enable-lifecycle configuration. After changing the configuration, you can use the reload API under curl. Changes to the configuration in prometheus-operator will trigger reload by default. If you do not use Operator and want to be able to listen for ConfigMap configuration changes to reload services, you can try this simple script.

#! / bin/sh FILE=$1 URL=$2 HASH=$ (md5sum $(readlink-f $FILE)) while true; do NEW_HASH=$ (md5sum $(readlink-f $FILE)) if ["$HASH"! = "$NEW_HASH"]; then HASH= "$NEW_HASH" echo "[$(date +% s)] Trigger refresh" curl-sSL-X POST "$2" > / dev/null fi sleep 5 done

When mounting the same ConfigMap as Prometheus, pass the following parameters:

Args:-/ etc/prometheus/prometheus.yml-http://prometheus.kube-system.svc.cluster.local:9090/-/reload args:-/ etc/alertmanager/alertmanager.yml-http://prometheus.kube-system.svc.cluster.local:9093/-/reload

How many indicators need to be exposed in your application?

When you develop your own service, you may expose some data to Metric, such as the number of specific requests, Goroutine, etc., how many metrics are appropriate?

Although the number of metrics is related to the size of your application, there are also some suggestions (Brian Brazil), such as simple services, such as caching, similar to Pushgateway, about 120indicators, Prometheus itself exposes about 700indicators, if your application is very large, try not to exceed 10000 indicators, you need to reasonably control your Label.

The problem with node-exporter

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Node-exporter does not support process monitoring, which has been mentioned earlier.

Node-exporter only supports Unix systems. For Windows machines, please use wmi_exporter. So when it is not node-exporter in the form of yaml, node-selector indicates the type of OS.

Because node_exporter is an older component, there are some best practices that are not included in merge, such as compliance with the Prometheus naming convention, so newer versions 0.16 and 0.17 are recommended.

Changes in the names of some indicators:

* node_cpu-> node_cpu_seconds_total * node_memory_MemTotal-> node_memory_MemTotal_bytes * node_memory_MemFree-> node_memory_MemFree_bytes * node_filesystem_avail-> node_filesystem_avail_bytes * node_filesystem_size-> node_filesystem_size_bytes * node_disk_io_time_ms-> node_disk_io_time_seconds_total * node_disk_reads_completed-> node_disk_reads_completed_ Total * node_disk_sectors_written-> node_disk_written_bytes_total * node_time-> node_time_seconds * node_boot_time-> node_boot_time_seconds * node_intr-> node_intr_total

If you used the older version of Exporter, the metric names will be different when drawing Grafana, and there are two solutions:

One is to start two versions of node-exporter on the machine and let Prometheus collect both.

The second is to use the indicator converter, which will convert the name of the old indicator into the new one.

The problem with kube-state-metric

The use and principle of kube-state-metric can be seen in this article [13].

In addition to the role mentioned in the article, kube-state-metric also has a very important use scenario, that is, it is combined with cAdvisor metrics. In the original cAdvisor, there is only Pod information, and I don't know which Deployment or sts I belong to. However, after doing a join query with kube_pod_info in kube-state-metric, it can be shown that the metadata metrics of kube-state-metric play a lot of roles in extending the label of cAdvisor. Many record rule of prometheus-operator use kube-state-metric for composite queries.

The label information of Pod can also be displayed in kube-state-metric, which makes it easier to do group by after getting the cAdvisor data, such as classifying according to the running environment of Pod. However, kube-state-metric does not expose Pod's annotation because of the high cardinality problem mentioned below, that is, annotation contains too much content to be used as an indicator.

Relabel_configs and metric_relabel_configs

Relabel_config occurs before collection, and metric_relabel_configs occurs after collection. Reasonable collocation can meet the configuration of many scenarios.

Such as:

Metric_relabel_configs:-separator:; regex: instance replacement: $1 action: labeldrop- source_labels: [_ meta_kubernetes_namespace, _ meta_kubernetes_endpoints_name, _ meta_kubernetes_service_annotation_prometheus_io_port] separator:; regex: (. +); (. +) (. *) target_label: _ _ metrics_path__ replacement: / api/v1/namespaces/$ {1} / services/$ {2}: ${3} / proxy/metrics action: replace

Prediction ability of Prometheus

Scenario 1: the remaining space on your disk is decreasing all the time, and the speed of the decrease is relatively uniform. You want to know how long it will take to reach the threshold, and you want to call the police at some point.

Scenario 2: your Pod memory usage is rising all the time. You want to know how long it will take to reach the Limit value, and call the police at a certain time to check before you are killed.

Prometheus's Deriv and Predict_Linear methods can meet this kind of demand, and Promtheus provides the basic prediction ability to guess the value after a period of time based on the current rate of change.

Take mem_free, for example, the free value has been falling in the last hour.

Mem_free is an example only. The actual memory can be based on mem_available.

The Deriv function can show the rate of change of the indicator over a period of time:

The predict_linear method is to predict the final value that can be achieved based on this speed:

Predict_linear (mem_free {instanceIP= "100.75.155.55"} [1h], 2mm 3600) / 1024Universe 1024

You can set reasonable alarm rules, such as less than 10:00:

Bash rule: predict_linear (mem_free {instanceIP= "100.75.155.55"} [1h], 2hr 3600) / 1024bind 1024 0.9.

Without label filtering, this alarm will take effect on all machines, but if you want to get rid of some of them, you have to add {instance! = ""} after disk_used and disk_total. These operations are simple in PromQL, but if you put them in a form, you have to interact with the internal CMDB to filter.

For some simple requirements, we use the alarm capability of Grafana. WYSIWYG can be configured directly under the chart. The alarm threshold and status are very clear. However, the alarm ability of Grafana is very weak, it is only an experimental function and can be used as debugging.

For common Pod or application monitoring, we have done some tabulation, as shown in the following figure: common metrics such as CPU, memory, disk IO and other common indicators are extracted as selections to facilitate configuration.

Use webhook to expand alarm capacity, transform alertmanager, do encryption and authentication in send message, connect internal alarm capabilities, and linkage user system, do current limit and authority control.

Call alertmanager api to query alarm events, display and statistics.

For users, encapsulated alertmanager yaml will become easy to use, but it will also limit its ability. When increasing the alarm configuration, R & D and operation and maintenance need to have some cooperation. If you have written a custom exporter, you should choose the required indicators for the user and adjust the PromQL for display and alarm. There are alarm templates, native PromQL exposure, user grouping and so on, which need to be weighed according to the needs of users.

Wrong high availability design

Some people have proposed this type of solution to improve its scalability and availability.

The application pushes Metric to a message queue such as Kafaka, then passes through Exposer consumption, and then is pulled by Prometheus. The reason for this solution is generally due to historical baggage, reuse of existing components, and the desire to improve scalability through MQ.

There are several problems with this approach:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

With the addition of the Queue component and an extra layer of dependency, if the connection between App and Queue fails, do you want to cache the monitoring data locally in App?

The fetch time may be out of sync, and the delayed data will be marked as stale data, which you can identify by adding a timestamp, but you will lose the logic for dealing with stale data.

Scalability problem: Prometheus is suitable for a large number of small goals, not a big one. If you put all the data in Exposer, then a single Job pull of Prometheus will become the bottleneck of CPU. This is similar to Pushgateway in that there are no particularly necessary scenarios and are not officially recommended.

4. Without service discovery and pull control, Prom only knows an Exposer, does not know which Target, does not know their UP time, cannot use metrics such as Scrape_* to query, and cannot use scrape_limit as a restriction.

If your architecture is at odds with the design philosophy of Prometheus, you may have to redesign the solution, otherwise scalability and reliability will be reduced.

The scene of prometheus-operator

If you deploy Prometheus in a Kubernetes cluster, you will most likely use prometheus-operator. He encapsulates the configuration of Prometheus with CRD, making it easier for users to expand Prometheus instances. At the same time, prometheus-operator also provides a wealth of Grafana templates, including the Grafana view monitored by Master components mentioned above, which can be used directly after Operator starts, eliminating the trouble of the configuration panel.

There are many advantages of Operator, so I won't list them one by one, just mention the limitations of Operator:

Because it is Operator, it depends on Kubernetes cluster. If you need binary deployment of your Prometheus, such as out-of-cluster deployment, it is difficult to use prometheus-operator, such as multi-cluster scenarios. Of course, you can also deploy Operator in the Kubernetes cluster to monitor other Kubernetes clusters, but there are a lot of tricks in it and some configurations need to be modified.

Operator blocks too many details, which is good for users, but there is some gap for understanding Prometheus architecture. For example, some users install Operator at one click, but do not know how to troubleshoot Grafana chart abnormalities. If you do not understand record rule and service discovery, configure directly. It is recommended that you be familiar with the basic usage of Prometheus before using Prometheus.

Operator facilitates the expansion and configuration of Prometheus, and it is convenient to achieve high availability of multiple instances for alertmanager and exporter. However, it does not solve the problem of high availability of Prometheus because it cannot handle data inconsistencies, and the current positioning of Operator is not in this direction, which is different from that of Thanos, Cortex and other solutions, which will be explained in detail below.

High availability scheme

There are several options for Prometheus High availability:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Basic HA: that is, two sets of Prometheus collect exactly the same data and hang load balancer on the outside.

HA + remote storage: in addition to the basic multi-copy Prometheus, it is also written to remote storage through Remote Write to solve the problem of storage persistence.

Federated cluster: Federation, which is partitioned according to function. Different Shard collects different data and stores it uniformly by Global node to solve the problem of monitoring data scale.

Use Thanos or Victoriametrics to solve global query and multi-copy data Join problems.

Even if you use the officially recommended multi-copy + federation, you will still encounter some problems:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Officials suggest that the data be Shard, and then use Federation to achieve high availability.

However, the edge node and Global node are still a single point, so we need to decide whether to use double-node repeated collection for survival in each layer. That is, there will still be a stand-alone bottleneck.

In addition, some sensitive alarms should not be triggered by Global nodes as much as possible. after all, the stability of the transmission link from Shard nodes to Global nodes will affect the efficiency of data arrival, thus reducing the effectiveness of the alarm.

For example, we put alarms such as service Updown status and abnormal API request on the Shard node to give an alarm.

The essential reason is that the local storage of Prometheus does not have the ability of data synchronization, so it is difficult to maintain data consistency on the premise of ensuring availability, and the basic HA Proxy can not meet the requirements, such as:

There are two instances An and B at the back end of the cluster, and there is no data synchronization between An and B. A lost some data when it was down for a period of time. If the load balancer polls normally, the data will be abnormal when the request is dialed to A.

If the startup time and clock of An and B are different, then the timestamp of collecting the same data is also different, so it is not the concept of multiple copies of the same data.

Even if remote storage is used, An and B cannot be pushed to the same TSDB. If each pushes his own TSDB, it is a problem which way the data query goes.

Therefore, the solution is to ensure the consistency of data in terms of storage and query:

Storage point of view: if you use Remote Write remote storage, both An and B can add an Adapter,Adapter as the main logic, and only one piece of data can be pushed to TSDB. This ensures that one exception can be pushed successfully, and the other can be successfully pushed without data loss. At the same time, there is only one copy of remote storage, which is shared data. The plan can refer to this article [15].

Query point of view: the implementation of the above scheme is very complex and risky, so most of the current solutions do things at the query level, such as Thanos or Victoriametrics, which are still two pieces of data, but do data deduplication and Join when querying. It's just that Thanos puts the data in the object store through Sidecar, and Victoriametrics Remote Write the data to its own Server instance, but the logic of the query layer Thanos-Query is basically the same as that of Victor's Promxy.

We use Thanos to support multi-region monitoring data. For more information, please see this article [16].

Container logs and events

This article is mainly about the monitoring content of Prometheus. Here, we only briefly introduce the log and event handling scheme in Kubernetes, as well as the collocation with Prometheus.

Log processing:

Log collection and push: generally, Fluentd/Fluent-Bit/Filebeat collects and pushes logs to ES, COS and Kafka, and logs should be handed over to professional EFK, which is divided into container standard output and in-container logs.

Log parsing to metric: you can extract some metrics for converting logs to Prometheus format, such as parsing the number of occurrences of specific strings, parsing Nginx logs to get QPS, request delay, and so on. The common solution is mtail or grok.

Log collection scheme:

Sidecar method: the log directory is shared with the business container, and log push is completed by sidecar, which is generally used in multi-tenant scenarios.

DaemonSet mode: run the collection process on the machine and push it out.

Note: for container standard output, the default log path is / var/lib/docker/containers/xxx, kubelet will change the log soft link to / var/log/pods, and a / var/log/containers is a soft chain for / var/log/pods. However, the directory format of the log varies with different Kubernetes versions, and the log is collected according to the version:

1.15 and below: / var/log/pods/ {pod_uid} /

Above 1.15: var/log/pods/ {pod_name+namespace+rs+uuid} /

Event: here, Kubernetes Events,Events is also critical when troubleshooting cluster problems, but it is only retained for 1 hour by default, so you need to persist Events. There are two general ways to handle Events:

Use components such as kube-eventer to collect Events and push it to ES

Use components such as event_exporter to convert Events to Prometheus Metric, as well as event-exporter under Google Cloud's stackdriver.

This is the end of the content of "what are the pits in the practice of high-availability Prometheus architecture". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.