How to understand distributed message system kafka 07/01 Update SLTechnology News&Howtos

How to understand distributed message system kafka

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to understand the distributed messaging system kafka, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following for you. I hope you can get something according to this article.

Kafka: a distributed messaging system

1. Background

Recently, due to the need of work, I have investigated the lightweight messaging system Kafka, which is in pursuit of high throughput, and intend to replace the ActiveMQ running online, mainly because the daily traffic of next year's budget is one billion, and the distributed implementation of ActiveMQ is very strange, so I hope to find a suitable distributed messaging system.

The following is the content of some knowledge and experience summarized in the process of research, welcome to clap bricks.

two。 Basic knowledge

2.1. What is message queuing

First of all, let's take a look at what a message queue is, and the Wikipedia explanation is translated as follows:

Queues provide an asynchronous communication protocol, which means that the sender and receiver of the message do not need to keep in touch with the message at the same time, and the message sent by the sender is stored in the queue until the receiver gets it.

Generally speaking, the sender of the message is called the producer, and the receiver of the message is called the consumer. Note that the two words in the definition are "asynchronous". Usually, the production speed of the producer is not equal to the consumption speed of the consumer; if the two programs always communicate synchronously, it is bound to have a free time. If the two programs continue to run, the average speed of consumers must be higher than that of producers, otherwise queue hoarding will increase; of course, if consumers do not have timely demand, they can also hoard messages in queues and concentrate on consumption.

At this point, let's talk about the classification of queues. Generally speaking, queues can be divided into three categories according to the differences between producers and consumers:

The first category is within an application (between processes or threads). It is believed that when you learn multithreading, you have written a "producer consumer" program, and the producer is responsible for production, putting the production results into a buffer (such as a shared array). Consumers take out consumption from the buffer, here, this buffer can be called "message queue".

In fact, the second category is also a special case of the first category, just as we like to treat operating systems and applications differently, the operating system has to deal with numerous complicated things, and the data exchange between processes and threads is supported by message queues.

The third category is "message queue" in a more general sense, which mainly acts on different applications, especially across machines and platforms, which makes the exchange of data more extensive. in general, an independent queue product not only realizes message transmission, but also provides the corresponding reliability, transaction, distributed and other characteristics, which decouple producers and consumers. Common consumer queue products can be divided into two categories according to whether they are open source or not:

Proprietary software: IBM WebSphere MQ,MSMQ...

Open source software: ActiveMQ, RabbitMQ, Kafka...

2.2.JMS and AMQP

Well, for the above third type of "message queue", in order to provide the function of message queue in different machines, there must be a unified specification, then SUN pops up, and as a cross-platform JAVA, it is bound to support cross-platform message transmission. based on this, SUN provides a set of message standards: Java Message Service, abbreviated JMS, but this set of specifications defines API-level standards, which can be easily exchanged in the JAVA system. However, for other platforms, it may be necessary for the message queuing product itself to support multi-protocols (such as OpenWire, STMOP).

The definition of AMQP is lower than JMS, which can be seen from the name (Advanced Message Queuing Protocol). It defines the protocol of Wire-level, which naturally has the characteristics of cross-platform and cross-language. The message queue based on this implementation can interact with any platform that supports the protocol.

One is API at JAVA level, and the other is Wire-level protocol, which is the most essential difference between JMS and AMQP. At the same time, there are two obvious differences between the two standards:

First, the message passing model; JMS is relatively simple, supporting the two most common Peer-2-Peer and publisher/subscriber; popular point-to-point and broadcast modes; while the definition of AMQP is more complex, it defines an exchange&binding mechanism, which supports five models: direct exchange, fanout exchange, topic exchange, headers exchange, system exchange, which is essentially the same as P2P and PUB/SUB, but more detailed.

The second is the supported message types. JMS supports multiple message models: TextMessage, MapMessage, BytesMessage, StreamMessage, ObjectMessage, Message, etc., while AMQP only has an byte array.

2.3.ActiveMQ

ActiveMQ is based on JMS implementation of Provider (can be understood as queue), it supports a variety of protocols, such as OpenWire,Stomp,AMQP, based on this, supports multiple platforms; supports transactions, supports distribution strategies, and a variety of message models above. Instead of going into the features of ActiveMQ, we will focus on the distributed model of ActiveMQ.

ActiveMQ supports distribution, it supports Master-Slave to provide high availability, and Broker-Cluster to provide load balancing, but its load is based on a Forwarding Bridge mechanism.

Under this mechanism, a cancellation will only be held by one broker at any time, and messages sent by producer may be forwarded by multiple broker before finally reaching consumer. It is conceivable that when there are more and more broker, almost every consumption will be forwarded, resulting in a significant decline in efficiency; and under this complex logic, the addition and removal of any broker appears to be very complex. These two points are the fundamental reasons why I do not recommend using ActiveMQ distributed clusters.

one

3.Kafka

OK, finally, let's talk about today's protagonist Kafka. I have never found an allusion to this strange name. Maybe it is the name of the developer's crush on the girl (gay friend) ^ _ ^. Kafka was developed by linkin. The original purpose is to deal with the huge activity stream data of linkin (login, browsing, clicking, sharing, liking, etc.). This part of the data capacity is huge, but the reliability requirement is not high. So increase throughput by sacrificing some reliability (which doesn't mean our data will be lost as a percentage, which we'll talk about later). It cuts out many complex features, such as transactions, distribution strategies, a variety of message models, etc.; it supports both online and offline consumption by persisting messages to disk through its own unique design; and it is designed for distribution. There is no stand-alone mode (or stand-alone mode is a special case of distribution) and can be well extended. In practical applications, Kafka can be used for message queuing, streaming (usually combined with storm), log aggregation and so on.

3.1. Architecture

two

Let's first take a macroscopic look at the architecture of Kafka. Producer cluster obtains the partition list corresponding to the written topic through zookeeper (what is actually written in broker list), and then sends messages sequentially (supporting its own distribution strategy). Broker cluster is responsible for message storage and delivery, supports Master Slaver model, and can be distributed and expandable; Consumer cluster obtains the partition list of topic from zookeeper, and then consumes. A partition can only be consumed by one consumer. Name Server clusters (usually zookeeper) provide coordination information such as name services. As for what is topic and what is partition, let's look at it next.

3.2.Topic

Topic is the queue identification of producer production and consumer consumption. A Topic consists of one or more partitions, and each partition can be stored on a separate broker. Consumers can send messages to any partition to achieve the distribution of production, and any partition can be and only be sent by one consumer message to achieve the distribution of consumption. Therefore, the design of partition provides a distributed basis.

three

At the same time, we can also find another advantage of this design from the figure above, because the messages in each partition are orderly, and a partition can only be consumed by one consumer, so Kafka can provide message ordering at the partition level, while traditional queues cannot guarantee order in the case of multiple consumer.

3.3. Message passing model

Traditional message queues provide at least two message models, a P2P and a PUB/SUB, but Kafka does not do so. Cleverly, it provides the concept of a consumer group, a message can be consumed by multiple consumer groups, but only by one consumer in a consumer group, so it is equivalent to the P2P model when there is only one consumer group, and the PUB/SUB model when there are multiple consumer groups.

four

3.4. Message persistence

In order to improve efficiency, many systems and components generally want to throw all the data into memory and then flush to disk regularly; but in fact, the same is true of modern operating systems, all modern operating systems are willing to convert free memory into disk cache (page cache), so it is difficult not to use it. For such a system, he keeps a copy of his data in memory as well as a copy in OS's page cache, which not only adds one step but also reduces memory usage by half; therefore, Kafka decides to use page cache directly. But random writes are slow, and additional operations and storage are needed to maintain the order of relationships between each other, which can be avoided by linear writes. In fact, the speed of linear writes (linear write) is about 300MB/ seconds, but then writes are only 50k/ seconds, and the difference is nearly 10000 times. In this way, the page cache-centered design of Kafka not only ensures efficiency, but also provides message persistence, and each consumer maintains the offser of the current read data (which can also be delegated to zookeeper), thus supporting both online and offline consumption.

3.5.Push vs. Pull

For message consumption, ActiveMQ uses the PUSH model, while Kafka uses the PULL model, both of which have their own advantages and disadvantages. For PUSH,broker, it is difficult to control the speed at which data is sent to different consumers, while PULL can be controlled by consumers themselves, but the PULL model may cause consumers to be blind without messages, which can be alleviated by long polling mechanism, while for streaming systems with messages delivered almost all the time. This effect is negligible.

3.6. Reliability.

Just said that Kafka sacrificed some reliability to improve throughput, many students may be worried about the loss of messages, so let's take a look at the reliability in various situations.

five

For the above model, let's look at it separately.

First, let's take a look at the reliability of message delivery. Kafka provides three modes for a message to be delivered successfully. The first one is regardless of everything, and sending it out is regarded as a success. Of course, this situation does not guarantee the successful delivery of the message to broker;. The second is for the Master Slave model, which is considered successful only when Master and all Slave receive messages. This model provides the highest delivery reliability, but damages the performance. The third model, that is, as long as the Master acknowledges the receipt of the message, it is considered to be delivered successfully; in actual use, it is selected according to the application characteristics, and the third model will neutralize the reliability and performance in most cases.

Let's look at the reliability of the message on broker, because the message is persisted to disk, so if you normally stop a broker, the data on it will not be lost. However, if the stop is not normal, the message that the page cache is too late to write to the disk may be lost, which can be alleviated by configuring the cycle and threshold of the flush page cache, but it will also affect the performance of writing to the disk frequently. It is another multiple choice question, which can be configured according to the actual situation.

Next, let's take a look at the reliability of message consumption. Kafka provides a "At least once" model, because the reading progress of messages is provided by offset, and offset can be maintained by consumers themselves or in zookeeper, but when consumer dies and offset does not write back immediately after message consumption, repeated reading may occur. This situation can also be alleviated by adjusting commit offset cycle and threshold. Even consumers themselves make consumption and commit offset a transaction solution, but if your app doesn't care about reconsumption, don't solve it at all in exchange for maximum performance.

Finally, let's take a look at the reliability of zookeeper. Obviously, he is going to die, everything is over, the earth will be destroyed, mankind will be extinct, and star travel cannot be saved. So the way to increase reliability is to deploy zookeeper as a cluster.

3.7. Performance

All right, having said that, let's actually test the performance of Kafka in various situations. For comparison, I also tested the performance of ActiveMQ in stand-alone mode. However, due to laziness, I did not set up an ActiveMQ cluster for testing, but based on its disgusting Forwarding Bridge model, I am also pessimistic.

First of all, the test environment is as follows:

Kafka:3 broker;8 core / 32G; default configuration

ActiveMQ:1 broker;8 core / 32G; default configuration

Producer: a machine simulates multi-producer;8 cores / 32G through multi-threading; default configuration, asynchronous sending

Consumer: a machine simulates multi-consumer;8 cores / 32G through multithreading; default configuration

Except for special instructions, production and consumption take place at the same time.

Then, I use the following characters to represent the various test conditions:

1T-1P3C-1P1C-1KW-1K:

1T:1 toipc

1P3C:1 partition 3 replication

1P1C:1 producer 1 consumer

1KW: 10 million messages

1K: 1K per message

I first tested ActiveMQ in the case of multi-Producer and multi-consumer on a single machine, and the result is better than I expected. The official data given is 1-2K data, 10-20K per second, so it is about 30-40MB/S, and the test result will be better in the case of multi-thread.

ActiveMQ-thread Produce Consume 1T-XXX-1P1C-1KW-1K 28.925MB/S 28.829MB/S 1T-XXX-3P3C-1KW-1K 43.711MB/S 41.791MB/S 1T-XXX-8P8C-1KW-1K 52.426MB/S 52.383MB/S

Then I tested the Kafka and simulated the stand-alone mode with a partition. As expected, under the stand-alone model, there was little difference between the two; while the official data said that producers could reach 50MB/S, consumers could reach 100MB/S, producers met the official data, but consumers I never pressed to that high speed.

Kafka- thread Produce Consume 1T-1P1C-1P1C-1KW-1K 29.214MB/S 29.117MB/S 1T-1P1C-3P3C-1KW-1K 46.168MB/S 43.018MB/S 1T-1P1C-8P8C-1KW-1K 52.140MB/S 51.975MB/S

For the next Kafka cluster, I wonder if the same number of messages will be affected by the increase in the number of topic. The test results are as follows, indicating that the more topic, the speed will slow down, which is also in line with expectations.

Kafka-topic Produce Consume 1T-3P3C-3P3C-1.2KW-1K 49.255MB/S 49.204MB/S 3T-3P3C-3P3C-0.4KW*3-1K 46.239MB/S 45.774MB/S

Then, in order to test the impact of partition on performance, the following tests are carried out, and we can see that the more the number of partition, the faster the overall production and consumption speed; but surprisingly, in the case of Only produce, the production efficiency is not significantly improved but slightly slower, it is suspected that it has something to do with page cache, there is no in-depth research.

Kafka-partition Produce Consume Only Produce Only Consume 1T-1P3C-1P1C-1KW-1K 29.213MB/S 29.117MB/S 28.941MB/S 34.360MB/S 1T-3P3C-3P3C-1KW-1K 47.103MB/S 46.966MB/S 46.540MB/S 66.219MB/S 1T-8P3C-8P8C-1KW-1K 61.522MB/S 61.412MB/S 60.703MB/S 72.701MB/S

To sum up, we can see that the performance and throughput of Kafka can be extended.

3.8. Risk point

For us, Kafka has two main risk points, first, in order to in-depth use must be familiar with the source code, and kafka source code is written in scala, we do not have the corresponding technical reserves, need to learn; second, kafka technology is relatively new, the current version is 0.8.1.1, looks immature.

After reading the above, do you have any further understanding of how to understand the distributed messaging system kafka? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.