How to verify the reliability of Kafka system 04/27 Update SLTechnology News&Howtos

How to verify the reliability of Kafka system

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I would like to talk to you about how to verify the reliability of the Kafka system, many people may not know much about it. In order to make you understand better, the editor summarized the following content for you. I hope you can get something according to this article.

When the system built through Kafka needs to provide specific reliability, we have configured Kafka and done the necessary processing for the application of producers and consumers, how to verify that the whole system has achieved the desired reliability?

1. Overview

Again, reliability is not something that can be easily obtained, and the method of verification is not simple, which is divided into three stages:

Verify the configuration of Kafka without the participation of producers and consumers, and confirm that the performance of Kafka is in line with expectations

Join the application of producers and consumers to confirm that the performance of producers and consumers is in line with expectations

After the application is launched, the indicators and logs of the application and Kafka are monitored, and the problems related to reliability are found and repaired.

two。 Verify the configuration

Verification: in fact, it is a test, and whether the actual effect is consistent with the expected effect, so you must confirm the expected results before verification. If there is an error in this step, the verification may be difficult to succeed.

Verifying the configuration does not mean using the naked eye to verify that the configuration file is correct, but using the tools provided by Kafka. Kafka has two classes under the org.apacha.kafka.tools package: VerifiableProducer and VerifiableConsumer, which can be run from the command line or can be used in various testing frameworks.

VerifiableProducer can send a certain number of messages according to the parameters we specify. The message content is a number increasing from 1. The parameters include acks, number of retries and sending rate, and so on. The runtime will print the success or failure of each message. VerifiableConsumer consumes messages produced by VerifiableProducer, prints message contents in consumption order, and prints messages that submit offset and partition reassignment.

Let's take a look at the parameters of these two command-line tools:

Because it's my first time to use it, I'll just choose a few parameters and set them:

Use VerifiableProducer to send data:

Then use VerifiableConsumer to receive the receipt:

Because max-messages is set to 10 and there are only five messages in topic, there is no exit.

The above is just a demonstration, because there is only one broker, and it is very stable, so you need to build more complex scenarios for actual testing:

Leader election, how long will it take to turn off broker,producer and consumer where leader is located?

Controller election, restart controller, how long will it take for the whole system to recover?

Can you restart a single message without losing a single message when you restart the broker one by one?

Dirty leader election, when there is a dirty leader election, what will happen to producer and consumer, can you accept the consequences?

Build the test scenario according to the actual needs, and when the tests have passed, you can move on to the next step.

3. Validate the application

In fact, the verification method of this step is very similar to that of the previous step, except that the producer and consumer are replaced with the application code developed by themselves, keeping the configuration of Kafka unchanged, starting the production and consumer in the application, and testing in the built scenario, such as:

Producers and consumers are disconnected from the Kafka cluster

There was a leader election.

Broker performs a rolling restart

Consumers perform a rolling restart

The producer makes a rolling restart.

If the test results do not meet expectations, find the cause, fix it, and proceed to the next step after all verification has passed.

4. Online monitoring

This step is very important because in case there are omissions in the first two steps or it is too late to do so, monitoring can ensure that problems are found in a timely manner and avoid losses.

The content of monitoring can include: JMX, logs and other more complex custom indicators.

JMX monitoring

Kafka has its own JMX monitoring, for broker, producers and consumers, there are different indicators to pay attention to.

For broker, there are many indicators worth monitoring, such as the number of partitions that do not reach the minimum number of ISR replicas, the number of partition replicas being synchronized, the number of offline partitions, the number of controller, the number of failed production requests, the number and time of leader elections, and so on.

For producers, the two indicators related to reliability are the average error rate and the average retry rate per message. If these two indicators rise, it indicates that there must be something wrong with the system.

For consumers, the most important indicator is consumption lag, which indicates how far this consumer's current consumption location lags behind the latest news from each section of the topic. Ideally, it fluctuates between 0 and a very small value, and if it increases to a certain threshold, it needs to be processed.

Log monitoring

The log monitoring of Kafka is not much different from that of other applications. Pay attention to the WARN and ERROR in the log. Any exception may affect the reliability.

Other monitoring

If you are not satisfied with JMX monitoring and log monitoring, you can expand or add other monitoring by yourself. The metrics reported by JMX can be expanded, and the contents of logs can also be increased, but the source code may need to be modified.

Monitor and control system

Generally speaking, the monitoring task of Kafka should be accomplished by a special monitoring and operation and maintenance fault management system. I have used two systems to monitor Kafka: Xiaomi's Open-Falcon and InfluxData's Telegraf + InfluxDB + Grafana suite. All right, can be more flexible to customize the content you want to monitor, while supporting a variety of alarm methods, such as Open-Falcon supports email and Wechat alarm, while the Grafana page aesthetic is quite good, there should be a lot of other, but I have not used it will not bullshit.

After reading the above, do you have any further understanding of how to verify the reliability of the Kafka system? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.