Detailed explanation of terminology Design in Kafka 07/04 Update SLTechnology News&Howtos

Detailed explanation of terminology Design in Kafka

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "detailed explanation of terminology Design in Kafka". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Terminology in Kafka

Broker: the middle kafka cluster, which stores messages, is a cluster of multiple server.

The classification method that topic:kafka provides for messages. Broker is used to store message data for different topic.

Producer: to produce data into a topic in broker.

Consumer: get data from a topic in broker.

Term design in Kafka:

1 、 Broker

The middle kafka cluster, which stores messages, is a cluster of multiple server.

2. Topic and message

Kafka organizes all messages to be stored in the form of multiple topic, and each topic can be split into multiple partition, and each partition is composed of one message. Each message is identified with an incremental sequence number that represents the order in which it comes in and is stored sequentially in the partition.

In this way, messages are organized in an id-by-id manner.

Producer selects a topic to produce messages, and messages are sent to the end of a partition by allocating policy append.

Consumer selects a topic and uses id to specify where to start consuming messages. Keep the id after the consumption is completed, and you can continue the consumption from this location next time, or from any other location.

The above id is called offset in kafka, and this organization and processing strategy provides the following benefits:

Consumers can flexibly specify offset consumption according to their needs.

It ensures the invariance of messages and provides a guarantee of thread safety for concurrent consumption. Each consumer retains its own offset, does not interfere with each other, and does not have thread safety problems.

Parallel efficiency of message access. Messages in each topic are organized into multiple partition,partition and evenly distributed to the cluster server. When producing and consuming messages, they will be routed to the designated partition, reducing competition and increasing the parallelism of the program.

Increase the scalability of the messaging system. The messages retained in each topic can be very large, splitting the message into multiple sub-messages through the partition and assigning the partition to different server by being responsible for balancing the strategy. In this way, when the machine is fully loaded, the messages can be redistributed evenly by expanding the capacity.

Ensure the reliability of the message. After the message consumption is completed, it will not be deleted. You can re-consume the message by resetting offset to ensure that the message will not be lost.

Flexible persistence strategy. Messages can be saved by specifying a time period, such as the most recent day, to save broker storage space.

Backup high availability. Messages are assigned to multiple server in partition units and backed up in partition units. The backup strategy is: 1 leader and N followers,leader accept read and write requests, and followers passively replicates leader. Leader and followers will be broken up in the cluster to ensure the high availability of partition.

3 、 Partitions

Each Topics is divided into one or more Partition, and each message in the Partition is marked with a sequential id, that is, offset, and the stored data is of configurable storage time.

4 、 producer

The following parameters are required for producer production messages:

Topic: to which topic to produce messages.

Partition: to which partition to produce messages.

Key: partition messages to different partition based on this key.

Message: message.

5 、 consumer

Traditional messaging systems have two modes:

Queue

Publish and subscribe

Kafka unifies the two modes through consumer group: each consumer will mark its own consumer group name, and then the system will group the consumer group by name, copy and distribute the message to all groups, and only one consumer in each group can consume the message. As shown below:

As a result, two extreme cases are deduced:

When all consumer have the same consumer group, the system changes to queue mode

When the consumer group of each consumer is not the same, the system becomes publish and subscribe

Note:

1. Consumer Groups provides isolation between topics and partitions. For example, if the consumer-C2 in Consumer Group An above dies, consumer-C1 will receive P1MagneP2, that is, other consumer in a consumer Group can be rebalanced after it is hung up. As shown below:

2. When multi-consumer concurrently consumes messages, it is easy to lead to message disorder. By restricting consumers to be synchronized, messages can be kept orderly, but this greatly reduces the concurrency of the program.

Through the concept of partition, kafka ensures the order of messages in partition and alleviates the above problems. Messages within the partition are copied and distributed to all packets, and only one consumer for each packet can consume the message. This semantics ensures that a grouping consumes messages for a partition, which is synchronous rather than concurrent. If a topic has only one partition, then the concurrent consumption of the topic is orderly, otherwise it is just a single partition.

In general messaging systems, there are two consumption models for consumer:

Push: the advantage lies in the high real-time performance of messages. The disadvantage is that it does not take into account the consumption power and saturation of consumer, which can easily lead to the collapse of consumer by producer.

Pull: the advantage is that the speed and quantity of consumption can be controlled to ensure that consumer will not be saturated. The disadvantage is that when there is no data, there will be empty polling and consume cpu.

Kafka uses pull, and uses configurable parameters to ensure that when there is data and the amount of data reaches a certain amount, the pull operation will be carried out on the processor side, otherwise it will always be in the block state. Kakfa uses the integer value consumer position to record the consumption status of a single partition, and a single message for a single partition can only be consumed by one consumer within the consumer group, so the maintenance cost is small. When the consumption is completed, the broker receives a confirmation, and the position points to the offset of the next consumption. Since the message will not be deleted, after the consumption is completed and the position update is completed, consumer can still reset the offset to re-consume historical messages.

Message sending semantics

Producer perspective

Messages are sent at most once: producer sends messages asynchronously or synchronously with 0 retries.

The message is sent at least once: producer sends the message synchronously, and the failure or timeout will be retried.

The message is sent and sent only once: supported by subsequent versions.

Consumer perspective

The message is consumed at most once: the consumer reads the message before confirming that the position,*** handles the message.

The message is consumed at least once: consumer reads the message first, then processes the message, and * * confirms the position.

Messages are consumed and consumed only once.

Note:

If the output after message processing (such as db) can guarantee the idempotency of message updates, then multiple consumption can also guarantee exactly once semantics.

If the output can support the two-phase commit protocol, it is guaranteed that the acknowledgement position and the processing of the output message succeed or fail at the same time.

The updated position is stored at the output of message processing, which ensures the atomicity of confirming position and processing output messages (simple and universal).

Usability

In kafka, under normal circumstances, all node are in a state of synchronization. When a node is in a state of non-synchronization, it means that there is something wrong with the whole system and fault tolerance needs to be done.

Synchronization represents:

The node can be connected with the zookeeper.

If the node is follower, then the gap between consumer position and leader cannot be too large (the difference is configurable).

The node in synchronization within a partition forms a collection, that is, the ISR of that partition.

Kafka is fault tolerant in two ways:

Data backup: backup in partition, the number of copies can be set. When the number of replicas is N, it represents a leader,N-1 and a followers,followers can be regarded as a consumer of leader. Pull the leader message and append to your own system.

Failover:

1. When leader is out of sync, the system elects a new leader from followers

two。 When a follower state becomes out of sync, leader removes the follower from the ISR and enters the ISR again after the follower is restored and data synchronization is completed.

In addition, kafka has a guarantee: when a producer produces a message, the message is submitted successfully only if the message is confirmed by all ISR. Only if a successful message is submitted can it be consumed by consumer.

Therefore, when there are N copies, N copies are all in ISR, and NMel 1 copy is abnormal, the system can still provide services.

Assuming that all N replicas are dead, node will face the process of synchronizing data after recovery. During this period, there is no node in ISR, which will cause the partition service to be unavailable. Kafka uses a downgrade measure to deal with it: electing * restored node to serve as leader, based on its data, this measure is called dirty leader election. Because leader mainly provides services, kafka broker distributes the leader of multiple partition equally on different server to share the risk. Every parition has a leader, and if you run the select main process within each partition, it will result in a very large number of selected main processes. Kakfa takes a lightweight approach: select one from the broker cluster as the controller, and this controller monitors the dead broker and selects the master in bulk for the partition above.

Consistency

The above scheme ensures that the data is highly available, sometimes at the expense of consistency. If you want to achieve strong consistency, you can take the following measures:

Dirty leader elections are disabled, and when ISR does not have node, it would rather not provide services than node that is not fully synchronized.

Set the minimum number of ISR min_isr to ensure that the message must be confirmed by at least min_isr node before it can be submitted.

Persistence

Based on the following facts, kafka relies heavily on disk rather than memory to store messages.

Hard drives are cheap and memory is expensive

Sequential read + pre-read operation, which can improve the cache rate

The operating system uses surplus memory as pagecache, coupled with pre-read (read-ahead) + write-back (write-back) technology to read data from cache, write to cache and return (operating system background flush) to improve the response speed of user processes.

The actual size of the java object is larger than the ideal size, making it expensive to store messages in memory

When the heap memory footprint increases, the gc jitter is larger.

Based on the design idea of sequential reading and writing of files, the code is easy to write.

In the choice of persistent data structure, kafka uses queue instead of Btree

Kafka only simply reads according to offset and append operations, so the time complexity of queue-based operations is O (1), while that of Btree-based operations is O (logN).

When reading and writing a large number of files, queue-based read and append require only one disk addressing, while Btree involves multiple times. The disk addressing process greatly degrades read and write performance

This is the end of the content of "detailed explanation of terminology Design in Kafka". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.