How to do micro service monitoring based on Prometheus 07/13 Update SLTechnology News&Howtos

How to do micro service monitoring based on Prometheus

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the knowledge of "how to do micro-service monitoring based on Prometheus". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Micro-service architecture is a software architecture widely adopted by major Internet companies. In the micro-service architecture, the system is divided into a number of small, independent services, which run in their own processes and can be developed and deployed independently. When the business changes rapidly, the single responsibility and autonomy of micro-service makes the boundary of the system clearer and improves the maintainability of the system; at the same time, it simplifies the complexity of system deployment and can be upgraded and released separately for a certain micro-service; when the business grows, it can also be expanded independently.

Although the micro-service architecture brings a lot of benefits, it also brings new problems. In previous single applications, troubleshooting often locates error messages and exception stacks by looking at logs; however, there are many services in micro-service architecture, so it is very difficult to locate problems when problems occur. In addition, micro-services often create new services by combining existing services, and the failure of a service is likely to produce an avalanche effect, resulting in the unavailability of the whole system. Therefore, how to monitor the operation of micro-services and give an alarm quickly when an exception occurs, which brings great challenges to developers.

A brief introduction to the monitoring system

1. Several main ways of monitoring

Several main ways of monitoring

In the micro-service architecture, different dimensions have different monitoring methods.

1) Health examination. Health check is to monitor the health status of the application itself and check whether the service is still alive.

2) Log. Log is the main way to troubleshoot problems, and logs can provide rich information for locating and solving problems.

3) call chain monitoring. Call chain monitoring can fully present all the information of a request, including service invocation link, time spent, and so on.

4) Index monitoring. Indicators are some discrete data points based on time series, which can reflect the trend of some important indicators through aggregation and calculation.

In the above 4 monitoring methods, health check is provided by infrastructure such as cloud platform, and logs generally have a separate log center for log collection, storage, calculation and query. Call chain monitoring generally also has an independent solution for burying, collecting, calculating and querying service calls. This article mainly discusses the fourth monitoring method.

2. Technology selection of micro-service monitoring

Several main ways of monitoring

Due to the characteristics of micro-service architecture, some traditional monitoring schemes are no longer applicable. In the traditional application monitoring, Zabbix is the most commonly used monitoring scheme. The advantages of Zabbix are mature and reliable, strong community support, and years of accumulated experience and programs. However, the shortcomings of Zabbix are also obvious: firstly, it is difficult to use and the learning curve is steep; secondly, the monitoring dimension of Zabbix is the host, which is not suitable for the cloud native environment of micro-services.

After research, we finally adopted Prometheus. The main reasons for choosing Prometheus are:

1) mature community support. Prometheus is an open source monitoring software with an active community that works well with cloud native environments.

2) easy to deploy and operate. There is only one binary file in the Prometheus core, and there is no other third party dependence, so it is very convenient to deploy OPS.

3) using the Pull model to pull the monitoring data from each monitoring target through the Pull mode of HTTP. The Push model generally collects information through Agent and pushes it to the collector. The Agent of each service needs to configure monitoring data items and information of the monitoring server, which will increase the difficulty of operation and maintenance when there are a large number of services. In addition, using the Push model, the monitoring server will receive a large number of requests and data at the same time during the traffic peak, which will cause great pressure on the monitoring server, and even the service is not available in serious cases.

4) powerful data model. The monitoring data collected by Prometheus exist in the built-in time series database in the form of indicators. In addition to the basic index names, custom tags are also supported. Rich dimensions can be defined through tags to facilitate the aggregation and calculation of monitoring data.

5) powerful query language PromQL. Through PromQL, we can query, aggregate, visualize and alarm the monitoring data.

6) perfect ecology. Common operating systems, databases, middleware, class libraries, programming languages, Prometheus provide access solutions, and provide Java/Golang/Ruby/Python and other languages of the client SDK, can quickly achieve custom monitoring logic.

7) High performance. Prometheus can handle hundreds of monitoring indicators and hundreds of thousands of data per second in a single instance, and has excellent performance in data acquisition and query.

Because the collected data may be lost, Prometheus is not suitable for scenarios where 100% accuracy is required for the collected data. In fact, for the scenario of the monitoring system, occasional data loss is perfectly acceptable.

Second, micro-service monitoring scheme based on Prometheus

1. The business characteristics of iQiyi

IQiyi is a platform for iqiyi to focus on the creation, distribution and realization of video content, carrying self-media, NetUniversity, online dramas, children, animation, knowledge, online variety, documentaries, literature, light novels, comics and other content. it is an important part of iqiyi's content ecology.

IQiyi adopts micro-service architecture as a whole, and is divided into different micro-services according to function and domain. External traffic is routed to different micro-service instances within the system after unified authentication, load balancing, current restriction and other operations are completed through DNS, QLB, front machine, gateway and other layers. In addition to proprietary MySQL, Redis, MQ and other resources, micro-services in the system share service registration / discovery, configuration center and other service governance capabilities.

The overall architecture of the system is shown in the following figure:

IQiyi serves the content creator, and the service quality directly determines the creator's use experience, affects the enthusiasm of content creation, and then affects the health of content ecology, so it has high requirements for service quality. At the same time, as a foreground business, iQiyi relies on many of the company's internal services and mid-Taiwan services, and the stability of services directly affects the quality of its own services.

Based on the business characteristics of iQiyi, when building the micro-service monitoring system, we focus on the monitoring of its own service interface and the third-party service interface.

2. Overview of micro-service monitoring system

We have built a micro-service monitoring system based on Prometheus, which is suitable for our business characteristics. Prometheus has provided a very rich range of components, and we have also developed some components to meet our monitoring needs.

The overall structure of the microservice monitoring system, as shown in the following figure:

Use Spring Boot Actuator and Micrometer to collect the monitoring data of the service, and expose it to Prometheus to pull

A monitoring data acquisition tool for third-party service interface is developed.

The qae-monitor component is developed to collect the monitoring data of the container when the service is running.

Dynamic service discovery based on file is developed to provide pull target for Prometheus.

The Alert proxy service is developed and the alarm content is delivered to the unified alarm platform.

Deploy using Prometheus federated cluster mode and use Grafana to monitor data presentation.

3. Overall monitoring of the service.

The monitoring system generally uses a hierarchical way to divide the monitoring objects. In our monitoring system, we mainly focus on the following types of monitoring objects:

Container environment monitoring, which mainly refers to some monitoring data of the environment in which the service is running.

Application service monitoring, which mainly refers to the basic data index of the service itself and the operation status of the withdrawal service itself.

Third-party interface monitoring, mainly refers to the invocation of other external service interfaces.

For application services and third-party interface monitoring, our commonly used indicators include: response time, request volume QPS, success rate.

1) Container environment monitoring

Micro-service applications are deployed on iqiyi's internal application cloud platform (QAE). In the cloud platform, multiple container instances exist on a host at the same time. The resource usage and performance characteristics collected by the host monitoring are actually the metric data of the host, not the running container.

Although Prometheus supports using cAdvisor for container monitoring, cAdvisor needs to be installed on the host, and QAE is a public platform, so it is not realistic to install and deploy other software on your own. Fortunately, QAE provides an open API, which solves this problem well.

The QAE platform has built-in monitoring functions, including container-level and application-level, which can not only be viewed on the QAE platform through the page, but also support the exposure of monitoring data through the HTTP interface, which makes it possible for us to carry out unified monitoring data collection.

We have developed a QAE container monitoring data collection service, qae-monitor. Qae-monitor service collects QAE monitoring data by customizing Prometheus Collector. The service regularly calls the HTTP interface of the QAE platform to grab the container monitoring data and organize it into the data format of Prometheus.

Qae-monitor itself exposes the monitoring data collection endpoint through Micrometer, and Prometheus grabs the collected monitoring data through this endpoint.

2) Application service monitoring

The basic monitoring data mainly refers to the metrics such as the runtime status and resource usage of the application service instance. Micrometer provides a wealth of application metrics by default. As long as you connect to Micrometer, you can collect these data directly, including:

System information, including run time, CPU utilization, system load, etc.

Memory usage, including heap and non-heap memory usage

Thread usage, including number of threads, number of guardian threads, thread peak, etc.

Class loading information

GC information, including GC times, GC consumption time, etc.

The situation of HTTP requests, which describes the performance metrics of HTTP requests, is a very important monitoring indicator. It is necessary to count the QPS, response time and success rate of HTTP services.

3) third-party interface monitoring

In the micro-service architecture, new services can be created by invoking and combining existing services, and the third-party interface will directly affect its own services, so the invocation of the third-party service interface is also worthy of attention. There are two main solutions on how to collect monitoring data from third-party service interfaces:

① explicit active acquisition

In the place where the third-party interface call occurs, the monitoring data is collected actively. Either hard-coding directly, or in the form of annotations or sections, the advantage is that the solution is simple, and the disadvantage is that it will invade the existing business code.

② implicit component acquisition

The logic of buried point collection is added to the used HTTP/RPC component, the advantage is that the business code does not need to be modified, and the disadvantage is that the HTTP/RPC component needs to be extended and upgraded.

We finally chose the second option, mainly for the following reasons:

First of all, our technical scheme is relatively unified, using HTTP protocol for service invocation, and the HTTP client component (fluent-hc) used is also based on the secondary encapsulation of Okhttp3, which is convenient for unified modification.

Secondly, Micrometer supports the collection of self-defined monitoring index data through SDK, and also provides many common component embedding schemes, of which Okhttp3 is one of them, which further simplifies the difficulty of monitoring data collection in the third-party interface.

Specifically, Micrometer provides an OkHttpMetricsEventListener component for collecting monitoring data for Okhttp. We just need to pass in the OkHttpMetricsEventListener instance when we build the Okhttp instance, or we can pass in an EventListener.Factory instance and return the OkHttpMetricsEventListener instance in the factory creation method. Okhttp officially added the EventListener function in version 3.11.0, and you need to pay attention to the version of Okhttp when using it.

Through the dimension of third-party interface monitoring, we can easily associate our own services with the third-party services we use. a unified view shows which third-party service interfaces are used by the service, and what is the response time and success rate of these third-party service interfaces. When the service is abnormal, it is of great help to the location problem; at the same time, some internal services may not have a comprehensive monitoring alarm, and third-party monitoring can also help them improve the quality of service.

4. File-based service discovery

The Pull model adopted by Prometheus needs to know which targets are being monitored. There are static and dynamic configuration monitoring targets in Prometheus, including static file configuration, file service discovery, Consul service discovery and so on. In addition, Prometheus also supports DNS, Microsoft Azure, Amazon EC2, Google GCE, Kubernetes and other service discovery methods.

Static configuration is the simplest, but it is not desirable in the actual production environment. Containers can be created and destroyed all the time, and it is impossible to set monitoring targets through static configuration. At the beginning, we chose Consul's service discovery, which introduces a centralized registry. When the micro-service starts, it registers the service instance with the registry, and Prometheus can query the service instance from the registry as the monitoring target.

However, we did not adopt Consul in the end for two main reasons:

First, micro-service access to Consul needs to involve code changes, although the changes are small, but the access to a large number of services still has a large cost.

Second, it is necessary to deploy and maintain a set of Consul environment separately, which brings new maintenance costs.

The principle of Prometheus service discovery is very simple. Through the interface provided by the third party, Prometheus queries the list of targets to be monitored, and then trains the monitoring targets in turn to obtain monitoring data. Because QAE is a private cloud platform, Prometheus cannot directly support it, but based on the above principles, we can implement a similar service discovery mechanism.

We have developed a file-based service discovery prom-sd-qae. Prom-sd-qae is a stand-alone program that is deployed on the same machine as the Prometheus service. It regularly grabs the container service list through the HTTP interface of the QAE platform, and generates JSON or YAML files on the local disk according to the format required by Prometheus, in which all the monitoring target lists are defined. Prometheus periodically reads the latest monitoring targets from the file and pulls monitoring data from them.

In this scheme, between two refresh monitoring targets, the monitoring targets may be destroyed and created, and there are temporary expired monitoring targets; however, this scheme takes into account the dynamics and simplicity of service discovery, it is still a simple and effective choice.

5. Unified alarm

Prometheus allows to define the trigger condition of alarm based on PromQL, Prometheus calculates PromQL periodically, and sends alarm information to Alertmanager when the condition is met.

When configuring alarm rules, we define the alarm rules of each service under one group, and each group defines several alarm rules, including response time alarm, interface success rate alarm, QPS alarm, third-party interface alarm and so on. This advantage is that the alarm rules are aggregated in the service dimension, which makes it more convenient to view and configure; in addition, the alarm thresholds of different services under the same alarm rule may be different, so they can also be configured independently.

The following figure is an example of an alarm rule:

After receiving the alarm, Alertmanager can carry out additional processing such as grouping, suppression, silence and so on, and then route to different receivers. Alertmanager supports a variety of alarm notification methods, in addition to commonly used mail notification, but also supports nailing, WeCom and other methods, but also supports custom notification through webhook.

Iqiyi's unified alarm platform realizes the unified processing of alarm topics, alarm contents, alarm channels and alarm subscriptions. We make full use of the unified alarm platform and develop Alert-proxy alarm agent service. Alertmanager sends the alarm to Alert-proxy,Alert-proxy through webhook and then to the unified alarm platform, and finally sends it to the final hot chat, e-mail, SMS and other receivers. Alert-proxy will send the alarm to the unified alarm platform, a default alarm topic Topic, and also support delivery to other Topic. Topic can be set separately for different services and different alarm levels to achieve more accurate notification touch and focus.

Alarm covers the service HTTP interface, the third party HTTP interface, as well as the status of JVM and container, which has basically met the requirements.

This is the end of the content of "how to do micro-service monitoring based on Prometheus". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.