Practical case: using Prometheus Operator for cluster monitoring 10/23 Update SLTechnology News&Howtos

Practical case: using Prometheus Operator for cluster monitoring

2025-10-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

According to the container report released by Sysdig, the use of containers and orchestration tools such as Kubernetes increased by more than 51%, and people began to host and manage workloads in the cluster. Given the short-lived state of the cluster, there is a very important requirement for an end-to-end cluster to be able to monitor nodes, containers, and pod in detail.

IT engineers need to manage applications, clusters (nodes and data), and reduce the workload of manually configuring service, targets, and data stores, while monitoring each time the application shuts down and returns. This requires a seamless deployment and management of highly available monitoring systems such as Prometheus, which can work with Operator to handle dynamic configuration of crawling targets, service discovery, and configuration rules for alerting various targets in the cluster. At the same time, code is written in Operator mode to reduce human intervention.

In this article, we will focus on how Prometheus Operator works and how service monitor discovers goals and metrics in Prometheus Operator.

The role of Prometheus Operator in cluster monitoring

Ability to seamlessly install Prometheus Operator using native Kubernetes configuration options

Can create and destroy an instance of Prometheus in the Kubernetes namespace, and a particular application or team can easily use Operator

Ability to preconfigure configuration files, including version, persistence, retention policy, and replica of Kubernetes resources

The target service can be discovered using tags and the monitoring target configuration can be automatically generated based on familiar Kubernetes tag queries.

For example, when pod / service is destroyed and returned, Prometheus Operator can automatically create a new configuration file without human intervention.

Required components Custom Resource Definition (CRD) in Operator mode: create a new custom resource, including a name and schema that can be specified, without any programming. Kubernetes API provides and handles storage of custom resources. Custom resources: objects that extend Kubernetes API or allow custom API to be introduced into kubernetes clusters. Custom controllers: handle built-in Kubernetes objects such as Deployment, Service, etc., or manage custom resources in a new way, just as managing native Kubernetes components Operator mode (for CRD and custom controllers): Operator adds configurations based on Kubernetes resources and controllers that allow Operator to perform common application tasks.

The workflow of Operator

Operator does the following in the background to manage custom resources:

1. CRD creation: CRD defines the specification and metadata based on which custom resources should be created. When creating a request for CRD, validate the metadata using the Kubernetes internal schema type (OpenAPI v3 schema), and then create a custom resource definition (CRD) object

2. Custom resource creation validates objects according to metadata and CRD specification, and creates custom object creation accordingly.

3. Operator (custom controller) begins to monitor event and its status changes, and manages custom resources based on CRD. It provides event to perform CRUD operations on custom resources, so that whenever the state of a custom resource is changed, the corresponding event can be triggered.

Service discovery and automatic configuration of acquired targets

Prometheus Operator uses Service Monitor CRD to perform automatic discovery and automatic configuration of acquired targets.

ServiceMonitoring includes the following components:

Service: it's actually service/deployment, which exposes metrics at defined endpoints and ports, and identifies them with corresponding tags. Whenever a service or pod fails, the service returns with the same label, thus making it discoverable by the service monitor.

Service Monitor: custom resources for service can be discovered based on matching tags. Servicemonitor is in the namespace where Prometheus CRD is deployed, but by using NamespaceSelector, it can still discover service deployed in other namespaces.

Prometheus CRD: a configuration based on tags that match service monitor and can generate Prometheus.

Prometheus Operator: it invokes the config-reloader component to automatically update the yaml configuration, which contains the details of the crawling target.

Next, let's look at a simple use case to understand how service is monitored during Prometheus Operator.

Use case: using Prometheus Operator for Gerrit service monitoring

Gerrit is a code review tool that is mainly used in the DevOps CI pipeline to review each submission before the code is put into storage. This article assumes that Gerrit is already running in the Kubernetes cluster, so I will not repeat the steps for Gerrit to run as a service in Kubernetes.

If you don't already have Prometheus Operator, you can use helm chart to install or use Rancher directly. In Rancher2.2 and above, Rancher will deploy a Prometheus Operator in the newly added cluster. The following components will be downloaded and installed by default:

Prometheus-operator

Prometheus

Alertmanager

Node-exporter

Kube-state-metrics

Grafana

Service monitors to scrape internal kubernetes components

Kube-apiserver kube-scheduler kube-controller-manager etcd kube-dns/coredns

The following steps show how Prometheus Operator automatically discovers Gerrit services running on a Kubernetes cluster and how to grab metrics from Gerrit.

Expose metrics using the Gerrit-Prometheus plug-in

You can use the Prometheus jar plug-in to expose Gerrit metrics, but you need to install the plug-in and run it on the Gerrit instance in advance.

Download address of Prometheus jar plug-in:

Https://gerrit-ci.gerritforge.com/, put jar in the Gerrit plug-in directory: / var/gerrit/review_site/plugins/, and restart the gerrit service.

Verify the Prometheus plug-in in the administrator's web interface: Gerrit-> Plugins-> Prometheus plugin.

Create an account and group and give access to view metrics

Log in to Gerrit's web interface with administrator privileges and visit: Projects > List > All-Projects. Click the [Access] tab, and then click the [edit] button.

In block global capabilities, click Add Permission and select [View Metrics] in the drop-down list.

Generate a token for the user in Gerrit.

Select the group "Prometheus Metrics" we created earlier and click the [Add] button.

Slide to the bottom of the page and click the [Save Changes] button.

Create a secret to access the Gerrit service

After generating token in Gerrit, you can use user id and token to generate user id and token in Base64-encoded format, which are used to store credentials in Kubernetes.

Create a yaml using the details of secret and create a secret in Kubernetes.

Kubectl apply-f gerrit-secret.yaml

Apply a label to a service

Tag the Gerrit service with two tags, for example: app: gerrit and release: prometheus-operator

Kubectl label svc gerrit app=gerrit release=prometheus-operator

Create a Service Monitor for Gerrit

Add the details of the endpoints to the servicemonitoring to discover the Gerrit service metrics and the selector with matching tags, as follows:

Tagged service selector

The tags under Selector are the tags used to identify the service:

Selector:matchLabels:app: gerritrelease: prometheus-operatorServiceMonitor selector

The tag under the metadata section refers to the tag used to identify the service monitor through Prometheus CRD.

Metadata:labels:app: gerritrelease: prometheus-operator

Namespaceselector: provides namespaces in the Kubernetes cluster where the Gerrit service runs. Service can run in any namespace, but service monitor can only be created in the namespace where Prometheus Operator runs, so that Prometheus CRD can recognize service monitor objects.

Match Service Monitor selector in Prometheus

Verify the portion of the Service Monitor selector in the Prometheus object using the following command:

Kubectl get prometheusMatch and apply the label as given in step 4b for the Prometheus object.serviceMonitorSelector:matchLabels:release: prometheus-operator

Note: if Prometheus-operator is deployed using helm, the label release=Prometheus-operator has been applied to the Prometheus object. We still need to match this tag in service monitor, because Prometheus CRD needs to determine the appropriate service monitor.

The above servicemonitor creation steps can be done using the prometheus-operator helm custom values.yaml.

Automatic discovery of Gerrit services

After the label is updated, the Prometheus custom object automatically calls config-reloader to read the terminal and update the Prometheus configuration file. This is a benefit of Prometheus Operator without manual intervention to create Prometheus profiles and update crawled configurations.

1. Open Prometheus url: http://prometheusip:nodeport

Kubectl get svc prometheus to get nodeport details and replace IP with node details.

2. Access the menu: Status-> Configuration to view the Prometheus configuration that is loaded automatically using the crawl configuration. In the scrape_configs section, you can view the details of Gerrit service monitor, as follows:

3. Access menu-> Status-> Targets or Service Discovery. If service monitor has successfully crawled Gerrit metrics, the goal should be shown as healthy [1/1up].

Gerrit Health indicators in Grafana

Gerrit exposes various metrics, such as JVM runtime, thread memory, heap size, error, and so on. These can be configured in the Grafana dashboard to monitor the performance and health of the Gerrit (shown below).

Gerrit indicators are exposed under scrape url:

Http://gerrit-svcip:nodeport/a/plugins/metrics-reporter-prometheus/metrics

Kubectl get svc prometheus- gets the service node port.

Replace gerrit-svcip and nodeport with the details of the gerrit IP / nodeport of the gerrit service, and the exposed metrics will be shown below.

The value of the indicator can be evaluated in the expression field in Prometheus-> Graph, such as caches_disk_cached_git_tags

Configure metrics in Grafana to monitor the health of Gerrit, select Prometheus as the data source, and configure widget in dashboard. Some key metrics that have been configured are JVM_threads, Uptime, Http_Plugin errors, memory usage, events, and so on.

Prometheus Operator facilitates seamless deployment and management of Prometheus, dynamic configuration of capture targets, service discovery, extensibility, and built-in SRE expertise, which accelerates cluster monitoring.

Out-of-the-box Prometheus

At the end of 2018, Rancher Labs announced enhanced support for Prometheus, which will provide greater visibility across multiple Kubernetes clusters and isolated tenant environments. In Rancher2.2 and above, whenever a new Kubernetes cluster is added to Rancher, Rancher deploys a Prometheus operator in the cluster and then creates a Prometheus deployment in the cluster. In addition, the following two features are supported:

Cluster-wide Prometheus deployments will be used to store cluster metrics (such as CPU nodes and memory consumption) and to store project-level metrics collected from applications deployed by a single user.

Communication between the project-level Grafana and the Prometheus will be done through a security agent that implements multi-tenancy for the Prometheus. The security agent tool PromQL statement ensures that queries can only be made through the namespace of the user's project.

Rancher's enhanced support for Prometheus ensures efficient deployment and effective monitoring for all Kubernetes clusters, all projects, and all users. The security agent ensures that data is not shared repeatedly among multi-tenants and that multi-tenancy is isolated. In addition, Rancher collects any custom metrics for exposing endpoints using data processed by Prometheus. All metrics can be used for alarms and decisions within Rancher. Simple operations are carried out by notifying users of Slack and PagerDuty, and complex operations are carried out by starting the horizontal expansion of the workload and finally increasing the load. Rancher also now has full security isolation and RBAC cluster-level and project-level metrics and dashboards.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.