How to analyze the Reliability of Apache TubeMQ data 07/09 Update SLTechnology News&Howtos

How to analyze the Reliability of Apache TubeMQ data

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly analyzes the relevant knowledge points of how to analyze the reliability of Apache TubeMQ data, the content is detailed and easy to understand, the operation details are reasonable, and has a certain reference value. If you are interested, you might as well follow the editor and learn more about "how to analyze the reliability of Apache TubeMQ data".

1. Preface

We have introduced the applicable scenarios of Apache TubeMQ. TubeMQ is suitable for business scenarios that tolerate a small amount of data loss in extreme cases, so under what circumstances may data loss occur in TubeMQ? Why is it designed like this? How does the same kind of MQ do this? This document plans to answer these questions.

two。 Conclusion

The data storage of Apache TubeMQ system is based on the replica scheme of single-node and multi-disk RAID10. Data may be lost only in the following scenarios:

When the machine is powered off, the data that has been successfully restored but has not been consumed and is in memory will be lost. After the machine is online, the data stored on disk will not be affected.

RAID10 hold persistent disk exception, data that has been returned successfully but has not been consumed will be affected; after disk repair, the stored but not recovered data will be lost

RAID10 can hold the daily bad disk, and the production and consumption of bad Broker nodes are not affected.

3. Can you give a quantitative data reliability index to evaluate the reliability of Apache TubeMQ system?

This question is also a topic with the most communication, and the most intuitive feeling is that the machine is easy to hang up, the TubeMQ has only a single node, and the reliability of the data is not high. The personal view here, as has always been expressed in other introductions, is that it is not appropriate to use the data reliability index to reflect the data reliability of the system, because it only reflects the results of the data reliability of the system. It is not directly related to how to solve the data reliability. From the introduction of the scenarios in which the data may be lost, we can see that the computer room of the production environment, the hardware of the server, and whether the business can consume the data immediately directly affect the data reliability of the system. It is these content failures that lead to the high or low data reliability of the system. Without these problems, the data reliability of the TubeMQ system is actually 100%, so We should evaluate and analyze the data reliability of the system by the failure rate index corresponding to the possible loss of data caused by the environment, which I think is fundamental.

According to the failure data statistics of our online environment in 2019, the failure rate of possible data loss caused by TubeMQ clusters in our environment is about 2.67%: with 1500 machines in the entire TubeMQ cluster, with an average of 35 trillion data connections per day, about 40 servers have experienced machine ping exceptions and persistent disk group damage anomalies in RAID10 hold throughout the year. Personally, I think the failure rate index of our environment can be further reduced, because most of the machines used in TubeMQ clusters are second-hand machines and equipment that have been eliminated after a few years of key business.

4. Why should Apache TubeMQ be designed as single-node storage?

Cost reduction: as we all know, it is very costly to achieve 100% data reliability. A MQ pipeline should be made similar to the design of a spaceship to construct redundant data backup of multiple sets of independent nodes to ensure that the data is not lost. From our analysis, about 90% of the business data transmitted through MQ can allow a small amount of data to be lost in extreme cases, and about 10% of the data is required not to lose any of them, such as transaction pipelining, money-related log data, and so on. If we take out these 10% of the highly reliable data separately, we can save a lot of cost. based on this idea, TubeMQ is responsible for completing data services that require high performance and allow a small amount of data loss in extreme cases.

As shown above, in addition to the scheme considerations, we have also made careful considerations in the design of the TubeMQ storage scheme: we have carried out the components of data and indexing according to the Topic dimension; we have superimposed a layer of memory storage on top of the file as an extension of disk file storage. However, we are not completely unscrupulous because our business tolerates a small amount of loss. For example, our data production uses QoS1 scheme; our data storage is controlled by forced cache flush (by strip, by time, by size); our disk failure is caused by automatic read-only or offline control (ServiceStatusHolder) of Broker nodes on the server based on operational policy. The production side also has automatic abnormal node perception and algorithm shielding for the quality of service of Broker nodes, and so on, in order to achieve high performance and improve the reliability of data as much as possible, and reduce the possibility of data loss.

How much cost can be reduced? Take the operation data that has been disclosed by a number of external manufacturers using Kafka. For 1 trillion daily connections, Kafka needs about 200 to 300 10 Gigabit machines. According to the operation data in 2019, TubeMQ is about 40 to 50 10 Gigabit machines. There are also some special cases that can be distinguished, such as independent clusters, different numbers of specific business units, etc., but the ratio of machine cost indicator data should not be much different. If these data indicators are converted into money, the cost savings can only be calculated in hundreds of millions of units of server costs.

Some people will say, is it possible for me to use a single copy of Kafka for business services to achieve the same cost-effectiveness as TubeMQ? What I want to express is that if we could, we wouldn't spend so much time and resources to improve TubeMQ and directly use the Kafka single-copy solution. In the early days of our open source, we did a comprehensive summary report on the performance of a single Broker, tubemq_perf_test_vs_Kafka, and you can find the corresponding specific differences above.

What I want to express here is that, in fact, the data reliability of the TubeMQ system itself is not low. Have you ever thought about how reliable the system data is under the multi-copy scheme of each MQ?

5. Multi-copy scheme analysis of similar MQ:

Kafka: from my personal analysis point of view, the multi-copy solution of Kafka in high-performance scenarios is only to do its best to ensure that multiple copies of data are not lost.

The Kafka replica mechanism identifies and distinguishes the number of replicas and the number of online synchronous copies through an AR set and an ISR set, and uses the replica.lag.time.max.ms parameter to record the most recent synchronization time of each copy to determine whether each Follower is still synchronized with Leader. Before version 0.9x, Kafka has another replica.lag.max.messages parameter that has been removed. The number of messages that the Follower copy lags behind the Leader copy, which is combined to determine the invalid copy. When running online, Kafka ensures that data is stored in multiple nodes through the number of ISR and the number of copies of the request: the server uses min.insync.replicas to ensure that the minimum number of copies in the ISR is synchronized, and the client can specify the number of requested Ack (0: no reply, 1:Leader storage that is OK,-1: all ISR nodes reply) to ensure that the data can be received by multiple copies.

From the design of this mechanism, we can clearly see that even the designers of Kafka are very clear that the data may not be synchronized from Leader to each Follower, and the replication is not timely, so the ISR identification is changed from (lag number, synchronization time) to (synchronization time), because the lag affects the determination of the number of ISR copies. The number of copies of this ISR affects the writing of the corresponding Topic. If the number of ISR of the Topic is 0, the native Kafka cannot write messages. For this reason, Kafka has added a unclean.leader.election.enable parameter to allow Topic replicas that are not in synchronization to be selected as Leader external services and do best-effort services.

From the above analysis, this replica mechanism of Kafka in big data's scenario can meet the needs of the backtracking consumer business. After the machine failure of the master copy, the data that has been synchronized to the replica can be consumed by the backtracking service. However, due to the problem of data loss in the above problems, the scenario requiring backtracking and ensuring that the data is not lost cannot be satisfied. At the same time, the resource consumption of this scheme is large and the utilization rate is very low. according to the configuration of 2 copies, the resources are at least doubled and the network bandwidth is reduced by 1 to 2. At the same time, in order to avoid the formation of 2 copies with an ISR of 0, the user is likely to configure 3 copies, so that the resource increases more and the resource utilization is lower, but the scheme is not cheap and effective to maintain an unreliable data service at such a high cost. Finally, when the partition has no replica, Kafka blocks the production traffic to drop 0 directly, which is unacceptable to business individuals in a high-traffic environment. Even if the configuration 3 replica mode is adopted, because the survival of 3 replicas is dynamic, in extreme cases, there is still the problem of production obstruction.

Pulsar: from my personal analysis point of view, we can guarantee that the data will not be lost, but it is influential in big data's scene.

Pulsar adopts a mode similar to Raft protocol, in which most copies are successfully written and the server actively Push requests to each Bookeeper replica node. This real-time multi-copy synchronization scheme can meet the vast majority of highly reliable business needs, and the eyes of users are discerning. I think the recent popularity of Pulsar has something to do with it meeting the business needs in the market. However, if you put it in big data's scene, thousands of Topic and tens of thousands of Partition, this multi-copy scheme will cost a lot of machine resources. Therefore, our TEG data platform uses Pulsar internal services for highly reliable data; at the same time, we also donate our improvements to Pulsar to the community.

TubeMQ: as described in its applicable scenario, TubeMQ is designed to meet the business data reporting pipeline requirements that allow a small amount of data to be lost in extreme cases, and takes a different self-research path according to the requirements of business cost and data reliability, and achieves different results in system reliability for fatal exceptions:

As long as there is any Broker in the Broker collection assigned by Topic, the external services of the Topic are available.

Based on the first point, as long as all the Topic in the cluster still have any Broker alive, the Topic external services of the entire cluster are available.

Even if the Master of the control node is all down, the new production and consumption in the cluster will be affected, but the registered production and consumption will not be affected, and the production and consumption can continue.

Based on the premise of lossy service, TubeMQ adopts the idea of ensuring that the data is not lost and the service is not blocked as far as possible, so as to make the scheme simple and easy to maintain: in the design of TubeMQ, the partition failure does not affect the overall external service of Topic, as long as the Topic has a partition to survive, the overall external service will not be blocked. TubeMQ's data latency P99 can be in milliseconds, which ensures that businesses can consume data as quickly as possible without losing as much as possible. The design performance of TubeMQ's unique data storage solution is at least 50% higher than that of Kafka's TPS (double the effect on some models). At the same time, with the help of different storage schemes, more Topic and partitions can be accommodated on a single machine, which can make the cluster larger and reduce maintenance costs. These different considerations and implementations are combined to enable TubeMQ to achieve a low-cost, high-performance, high-stability foundation.

This is the end of the introduction on "how to analyze the reliability of Apache TubeMQ data". More related content can be searched for previous articles, hoping to help you answer questions and questions, please support the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.