How to do a good job of black box monitoring from the interpretation of Kafka Monitor source code 07/03 Update SLTechnology News&Howtos

How to do a good job of black box monitoring from the interpretation of Kafka Monitor source code

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you from the Kafka Monitor source code interpretation to see how to do a good job of black box monitoring, the content is very detailed, interested friends can refer to, hope to be helpful to you.

The first is the "Monitoring" series.

As we all know, monitoring is divided into black-box and white-box monitoring. Black-box monitoring is a way to monitor its visible system functions by simulating external users. As an important part of monitoring, black box monitoring provides the ability to quickly notify relevant personnel when a system or service fails.

Usually, the data monitored by the white box comes from the service or the system itself (such as CPU load, stack information, number of connections), so it is easy to collect. Relatively speaking, the data of black box monitoring usually comes from outside the system and services, so we need to develop related functional monitoring modules to complete the collection. So, how to do black box monitoring? How can we find the service failure in time without causing other problems?

Focus on the interpretation of part of the code of Kafka Monitor monitoring logic.

The following will share some practical experience of Jingdongyun in Kafka black box monitoring, which focuses on the interpretation of part of the code of Kafka Monitor monitoring logic, so that you can have a more in-depth understanding of its excellent design. Then combine our black box monitoring practice in other services to try to answer the questions raised above. Enjoy:

Kafka Monitor introduction

Kafka Monitor is a very excellent black box monitoring software for Kafka, which is open source by Linkedin. It achieves the purpose of black box monitoring by simulating client behavior, producing and consuming data, and collecting performance and availability indicators such as delay, error rate and repetition rate of messages.

The main concepts of Kafka

Before introducing Kafka Monitor feature monitoring, let's take a look at several main concepts of Kafka:

A Broker:Kafka cluster contains one or more servers, which are called broker

Topic: every message published to the Kafka cluster has a category, which is called Topic. Physically, messages with different Topic are stored separately. Logically, a Topic message is saved on one or more broker, but users only need to specify the Topic of the message to produce or consume data without caring about where the data is stored.

Partition:Partition is a physical concept, and each Topic contains one or more Partition

Producer: the message producer, the client responsible for publishing the message to Kafka broker

Consumer: message consumer, client that reads Kafka broker messages

Consumer Group: consumer group, each Consumer belongs to a specific Consumer Group

Figure 1 Kafka architecture diagram

Kafka Monitor module composition

1.kafka Monitor consists of the following five services

Jetty Service: provides HTTP services for Web UI presentation

Jolokia Service: provides HTTP interface for JMX

Produce Service: producer service, reporting production rate and production availability

Consumer Service: consumer services, reporting consumption rate and availability, message latency, loss rate and repetition rate

Metrics Service: monitoring indicators reported by Produce Service and Consumer Service

two。 The structure diagram between the services is as follows

Fig. 2 Kafka monitor structure diagram

Monitoring workflow and code interpretation

After 1.Producer Service starts, the production data takes a certain period of time (configuration item: produce.record.delay.ms, default: 100ms). It is important to note that Producer Service initiates a separate production task for each Partition so that the production data in each cycle can be covered on all Partition.

Figure 3 Producer Service code interpretation

two。 Each message consists of the following:

Message sequence number, which is used to check whether the message is lost or duplicated during consumption

Timestamp, which is used to calculate the delay of messages from production to consumption

The size of the message, which is used to specify the serialized data size (configuration item: produce.record.size.byte, default: 100 byte)

Topic and Producer ID to ensure that the data consumed is from the same Topic and Producer

Each message is serialized and submitted to the specified Topic in Kafka, and then the failure or success status is reported through the _ sensors object

Figure 4 Producer Service Code interpretation 2

3.Consumer Service reads messages from the specified Topic consumption. After each message is deserialized and verified, the monitoring metrics such as delay, error or repetition of the message are calculated and reported to Metrics Service through the _ sensors object.

Figure 5 Consumer Service code interpretation

Summary of Kakfa Monitor advantages

1. Ensure that monitoring covers all Partition by starting separate production tasks for each Partition.

One thing to note here is that Kafka Monitor can only guarantee that monitoring covers all Partition, but not all Broker. Therefore, in order to ensure that the monitoring covers all Broker, using the principle of balanced allocation of Partition to Partition in Broker by Kafka, we need to configure the Topic of Kafka Monitor with the same (or integer multiple) number of Partition as Broker.

two。 The time stamp and sequence number are included in the produced message, and Kafka Monitor can count the delay, loss rate and repetition rate of the message based on these data.

3. The purpose of controlling traffic is achieved by setting the frequency of message generation.

4. The production message is specified as a configurable size when serialized, which has the following benefits:

It is easy to verify the ability of Kafka to handle data of different sizes through configurable message length.

The same message size can reduce the monitoring error of Kafka due to the uneven performance of processing data of different sizes each time.

5. By setting separate Topic and Producer ID to operate the Kafka cluster, we can avoid contamination of online data and achieve a certain degree of data isolation.

How to do black box monitoring

Through the above content, I believe you have a certain understanding of the black box monitoring implementation of Kafka Monitor. Combined with the problems encountered in the practice of black box monitoring, this paper roughly summarizes the matters needing attention in black box monitoring and some suggestions:

Collection of monitoring indicators

There are two main types of monitoring indicators collected by black box monitoring: performance and availability. For the collection of these two types of monitoring items, please refer to the following suggestions:

In read-write operations, delay monitoring is carried out by carrying Timestamp in the message body

Use fixed strings to monitor semantic correctness to avoid judging only the returned status code

Sample coverage

The collected samples of black box monitoring should cover all nodes as far as possible, so that the faults caused by node downtime can be found in time. Sample coverage should be collectable and quantifiable. In practice, we recommend that the request for monitoring samples include specific tags that can be identified on the server node (which can be specific source IP, user name, request header, and so on), so as to facilitate sample coverage statistics.

Necessary flow control

Black box monitoring is not a stress test and should avoid the impact of excessive traffic on online services. When necessary, the setting of flow control needs to be combined with two indicators: node coverage and functional coverage. For example, in the black box monitoring practice of Zookeeper, taking into account the different reading and writing logic of Zookeeper, the upper limit of pressure is also different, so we need to set different monitoring samples for the read and write functions, so that the monitoring samples of the two functions can not only meet the sample coverage, but also will not have an impact on the online service.

Data isolation

Due to its characteristics, black box monitoring directly simulates user behavior to read and write to online services, so necessary data isolation is very necessary. The specific isolation method needs to depend on different business scenarios. For example, in HDFS's black box monitoring practice, we use a separate unprivileged account isolated from the business to read and write data under the specified path.

Functional coverage

Black box monitoring should cover all (important) functional scenarios as far as possible. This requires us to have a better understanding of services and online usage scenarios.

Timeout processing

The timeout period should be set for each monitoring request to avoid affecting the service due to the accumulation of requests caused by slow service response.

Keep it as simple as possible

The implementation logic of black box monitoring should be as simple as possible while fully simulating the behavior of external users, and reduce the dependence on external services, so as to reduce the monitoring data anomalies caused by the problems of the relying party or the monitoring itself.

From the Kafka Monitor source code interpretation to see how to do a good job of black box monitoring is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.