Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Summary of knowledge points of Kafka

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "the summary of knowledge points of Kafka". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Kafka is a distributed, publish / subscribe based messaging system.

2. Comparison of commonly used Message Queue

RabbitMQ

RabbitMQ is an open source message queue written in Erlang, which supports many protocols: AMQP,XMPP, SMTP, STOMP. Because of this, it is very heavyweight and more suitable for enterprise development. At the same time, the Broker architecture is implemented, which means that messages are queued in the central queue before being sent to the client. Good support for routing, load balancing or data persistence.

Redis

Redis is a NoSQL database based on Key-Value pairs, which is very active in development and maintenance. Although it is a Key-Value database storage system, it supports MQ functionality, so it can be used as a lightweight queuing service. For the entry and exit operations of RabbitMQ and Redis, each performed 1 million times, and the execution time was recorded every 100000 times. The test data are divided into four different sizes: 128Bytes, 512Bytes, 1K and 10K. The experimental results show that when joining the team, the performance of Redis is higher than that of RabbitMQ when the data is compared, but it is unbearably slow if the data size exceeds 10K; when leaving the team, Redis shows very good performance regardless of the size of the data, while the performance of RabbitMQ is much lower than that of Redis.

ZeroMQ

ZeroMQ claims to be the fastest message queuing system, especially for high-throughput demand scenarios. ZMQ can implement advanced / complex queues that RabbitMQ is not good at, but developers need to combine a variety of technical frameworks themselves, and the technical complexity is a challenge to the successful application of this MQ. ZeroMQ has a unique non-middleware mode, you do not need to install and run a message server or middleware, because your application will play this service role. You can simply reference the ZeroMQ library, install it using NuGet, and then you can happily send messages between applications. But ZeroMQ only provides non-persistent queues, which means that if there is an outage, data will be lost. Among them, ZeroMQ is used as the transport of data streams by default in Storm versions of Twitter prior to 0.9.0 (Storm supports both ZeroMQ and Netty as transport modules since version 0.9).

ActiveMQ

ActiveMQ is a sub-project under Apache. Similar to ZeroMQ, it can implement queues in proxy and peer-to-peer technologies. At the same time, similar to RabbitMQ, it can efficiently implement advanced application scenarios with a small amount of code.

Kafka/Jafka

Kafka, a sub-project of Apache, is a high-performance cross-language distributed publish / subscribe message queuing system, while Jafka is hatched on Kafka, that is, an upgraded version of Kafka. It has the following characteristics: fast persistence, which can persist messages under O (1) system overhead; high throughput, which can reach the throughput rate of 10W/s on an ordinary server; complete distributed system, Broker, Producer, Consumer all support distribution automatically and realize complex equilibrium automatically. Support Hadoop data parallel loading, for the same log data and offline analysis system as Hadoop, but require real-time processing limitations, this is a feasible solution. Kafka unifies online and offline message processing through Hadoop's parallel loading mechanism. Apache Kafka is a very lightweight messaging system compared to ActiveMQ. It is not only a very good performance, but also a well-working distributed system.

3. It has been proved that the efficiency of sequential write disk is higher than that of random write memory, which is a very important guarantee of high throughput of Kafka.

4. When each message is sent to broker, the partition to which it is stored will be selected according to the paritition rules. If the partition rules are set properly, all messages can be evenly distributed to different partition, thus achieving horizontal scaling. If a topic corresponds to a file, the machine on which the file is located will become the performance bottleneck of the topic, and partition solves this problem.

5. For traditional message queue, messages that have been consumed are generally deleted, while Kafka clusters retain all messages, whether they are consumed or not. Of course, because of disk limitations, it is not possible to retain all data permanently (and it is actually not necessary), so Kafka provides two strategies to delete old data. One is based on time, and the other is based on partition file size. For example, you can configure $KAFKA_HOME/config/server.properties to let Kafka delete data from a week ago, or you can configure Kafka to delete old data when the partition file exceeds 1GB.

6. The time complexity for Kafka to read a specific message is O (1), that is, it has nothing to do with file size, so deleting files here has nothing to do with Kafka performance. The deletion strategy you choose only depends on the disk and specific requirements. In addition, Kafka retains some metadata information for each consumer group-the position of the currently consumed message, that is, offset. This offset is controlled by consumer. Normally, consumer will linearly increase the offset after consuming a message. Of course, consumer can also set offset to a smaller value to re-consume some messages. Because offet is controlled by consumer, Kafka broker is stateless, it does not need to mark which messages are passed by which consumer, and does not need to use broker to ensure that only one consumer in the same consumer group can consume a certain message, so there is no need for locking mechanism, which also provides a strong guarantee for the high throughput of Kafka.

7. A message is considered to have been submitted only if all follower in the "in sync" list are copied from the leader. This prevents some of the data from being written into leader and downtime before it is replicated by any follower, resulting in data loss (consumer cannot consume the data). In the case of producer, it can choose whether to wait for the message commit, which can be set through request.required.acks. This mechanism ensures that as long as the "in sync" list has one or more flollower, a message that is commit will not be lost.

8. The replication mechanism here is neither synchronous replication nor simple asynchronous replication. In fact, synchronous replication requires that the "living" follower is replicated before this message is considered commit, which greatly affects throughput (high throughput is a very important feature of Kafka). In asynchronous replication mode, follower replicates data asynchronously from leader, and data is considered commit as long as it is written by leader to log. In this case, if follwer lags behind leader, and leader suddenly goes down, the data will be lost. Kafka's use of "in sync" list is a good balance to ensure that data is not lost and throughput. Follower can copy data from leader in batches, which greatly improves replication performance (bulk writing to disk) and greatly reduces the gap between follower and leader (as mentioned earlier, as long as follower is not too far behind leader, it is considered to be in "in sync" list).

9. Each consumer instance belongs to a consumer group, and each message is consumed by only one consumer instance in the same consumer group. (different consumer group can consume the same message simultaneously)

10. In fact, one of the design ideas of Kafka is to provide both offline processing and real-time processing. According to this feature, real-time streaming systems such as Storm can be used for real-time online processing of messages, while batch processing systems such as Hadoop can be used for offline processing, and data can be backed up to another data center in real time at the same time. You only need to make sure that the consumer used in these three operations is in different consumer group.

11. Kafka guarantees At least once by default, and allows At most once to be implemented by setting producer asynchronous commit.

12. Running multiple instances on one machine will not help much to increase the throughput, because the network card is basically saturated

13. It is important to note that replication factor does not affect the throughput test of consumer, because consumer only reads data from the leader of each partition, regardless of replicaiton factor. Similarly, consumer throughput has nothing to do with synchronous or asynchronous replication.

All of the above tests are based on short messages (payload 100bytes), which, as mentioned above, are a more difficult way to use Kafka, and it can be expected that as the length of the message increases, records/second will decrease, but MB/second will improve. The following figure is a diagram of the relationship between records/second and message length.

As we expected, the number of messages that can be sent per second decreases as the length of the message increases. But if you look at the total size of messages sent per second, it increases as the length of the message increases, as shown in the following figure.

As can be seen from the above figure, when the message length is 10 bytes, because it takes too much time to acquire the lock because of joining the queue frequently, CPU becomes a bottleneck and cannot make full use of the bandwidth. But starting at 100 bytes, we can see that bandwidth usage tends to be saturated (although MB/second still increases with the length of the message, the increase is getting smaller and smaller).

15. Kafka supports the compression of the message set, and the producer side can compress the message set through GZIP or Snappy format. After the Producer end is compressed, it needs to be decompressed at the Consumer side. The advantage of compression is to reduce the amount of data transmitted and reduce the pressure on network transmission. In the processing of big data, the bottleneck is often reflected in the network rather than CPU (compression and decompression will consume some CPU resources). So how to tell whether the message is compressed or uncompressed? Kafka adds a description compression attribute byte to the message header. The last two bits of this byte represent the encoding used for the compression of the message. If the last two bits are 0, the message is not compressed.

This is the end of the summary of the knowledge points of Kafka. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report