What are the knowledge points of kafka? 07/15 Update SLTechnology News&Howtos

What are the knowledge points of kafka?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What are the knowledge points of kafka? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

1 what is kafka

Kafka is a distributed publish-subscribe messaging system, which was originally developed by LinkedIn and later became a part of the Apache project. Kafka is a distributed, divisible, redundant and persistent log service, which is mainly used to deal with streaming data.

2 Why use kafka and why use message queuing

Buffering and peaking: there is sudden traffic in the upstream data, which may not be supported downstream, or there are not enough machines downstream to ensure redundancy. Kafka can act as a buffer in the middle. The message is temporarily stored in the kafka, and the downstream service can process it slowly at its own pace.

Decoupling and extensibility: specific requirements cannot be determined at the beginning of the project. Message queuing can act as an interface layer to decouple important business processes. You only need to follow the convention and program against the data to gain scalability.

Redundancy: in an one-to-many manner, one producer publishes messages, which can be consumed by multiple services that subscribe to topic for use by multiple unrelated businesses.

Robustness: message queues can stack requests, so even if the consumer business dies for a short time, it will not affect the normal operation of the main business.

Asynchronous communication: in many cases, users do not want or need to process messages immediately. Message queuing provides an asynchronous processing mechanism that allows users to put a message on the queue, but does not process it immediately. Put as many messages as you want in the queue, and then process them when needed.

What do the ISR and AR in 3.Kafka represent? What does the expansion of ISR mean?

ISR:In-Sync Replicas replica synchronization queue AR:Assigned Replicas all replica ISR is maintained by leader. Follower has some delays in synchronizing data from leader (including two dimensions: delay time replica.lag.time.max.ms and delay bars replica.lag.max.messages. Only replica.lag.time.max.ms is supported in the latest version 0.10.x) If any one exceeds the threshold, the follower will be removed from the ISR and stored in the OSR (Outof-Sync Replicas) list, and the newly added follower will be stored in the OSR first. AR=ISR+OSR .

What does broker in 4.kafka do?

Broker is the agent of the message. Producers writes the message to the specified Topic in the Brokers. Consumers pulls the message of the specified Topic from the Brokers, and then carries out business processing. In the middle, broker acts as a transit station for the broker to save the message.

What is the role of zookeeper in 5.kafka? can we not use zookeeper?

Zookeeper is a distributed coordination component. Early versions of kafka use zk for meta information storage, consumer consumption status, group management, and offset values. Considering some factors of zk itself and the high probability of single point problem in the whole architecture, the role of zookeeper is gradually weakened in the new version. The new consumer uses the group coordination protocol within kafka and reduces its dependence on zookeeper

But broker still depends on the fact that ZK,zookeeper is also used in kafka to elect controller and detect the survival of broker, and so on.

How 6.kafka follower synchronizes data with leader

The replication mechanism of Kafka is neither complete synchronous replication nor simple asynchronous replication. Full synchronous replication requires that the All Alive Follower is replicated before this message is considered commit, which greatly affects the throughput. In asynchronous replication mode, Follower replicates data asynchronously from Leader, and data is considered to have been commit as long as it is written by Leader to log. In this case, if leader dies, data will be lost. Kafka uses ISR to balance data loss and throughput. Follower can copy data from Leader in batches, and Leader makes full use of disk sequential read and send file (zero copy) mechanism, which greatly improves replication performance, writes disks in batches, and greatly reduces the message gap between Follower and Leader.

7. Under what circumstances will a broker be kicked out of isr?

Leader maintains a list of Replica that is basically synchronized with it, which is called ISR (in-sync Replica). Each Partition has an ISR and is dynamically maintained by leader. If a follower lags far behind a leader, or if a data replication request is not initiated for a certain period of time, leader removes it from the ISR.

Why is 8.kafka so fast?

Cache Filesystem Cache PageCache caching

Sequential write because modern operating systems provide pre-read and write technology, sequential write to disk is faster than random write memory in most cases.

Zero-copy Zero copy Technology reduces the number of copies

Batching of Messages batch processing. Merge small requests, and then interact with each other in a streaming manner, directly to the upper limit of the network.

Pull pull mode uses pull mode for message acquisition and consumption, which is consistent with the processing capacity of consumers.

How to optimize the break-in Speed of 9.kafka producer

Add threads

Improve batch.size

Add more producer instances

Increase the number of partition

When setting acks=-1, if the delay increases: you can increase num.replica.fetchers (the number of threads that follower synchronizes data) to mediate.

Transfer across data centers: add socket buffer settings and OS tcp buffer settings.

10.kafka producer type data, what does it stand for when ack is 0, 1,-1, and under what circumstances will leader consider a message commit when setting-1?

1 (default) after the data is sent to Kafka, it is confirmed that the message was successfully received by leader, even if it is sent successfully. In this case, if leader goes down, data will be lost.

The producer just sends the data out and doesn't wait for anything to return. In this case, the data transmission efficiency is the highest, but the data reliability is the lowest.

-1 producer needs to wait for all the follower in the ISR to confirm that it has received the data before it is sent at a time, which has the highest reliability. When all the Replica in the ISR sends ACK to the Leader, the leader is commit, and then the producer can assume that the messages in a request are commit.

What does 11.kafka unclean configuration represent and what impact will it have on spark streaming consumption?

If unclean.leader.election.enable is true, it means that the broker of non-ISR collection can also participate in the election, so data may be lost, and the end offset obtained by spark streaming in the process of consumption will suddenly become smaller, causing the spark streaming job to hang up. If the unclean.leader.election.enable parameter is set to true, data loss and data inconsistencies may occur, and the reliability of Kafka will be reduced; and if the unclean.leader.election.enable parameter is set to false,Kafka, the availability will be reduced.

twelve。 What if ISR is empty when leader crash

Kafka provides a configuration parameter on the broker side: unclean.leader.election, which has two values: true (default): allows the asynchronous copy to become leader, and becomes leader because the message of the asynchronous copy is lagging behind, which may lead to message inconsistency. False: asynchronous copies are not allowed to become leader. If the ISR list is empty, the old leader will be restored all the time, which reduces the availability.

What is the message format of 13.kafka?

The Message of a Kafka consists of a fixed length header and a variable length message body body.

The header part consists of an one-byte magic (file format) and a four-byte CRC32 (used to determine whether the body of the body message is normal).

When the value of magic is 1, there is an extra byte of data between magic and crc32: attributes (holds some relevant attributes, such as whether or not to compress, compressed format, and so on); if the value of magic is 0, then there is no attributes attribute. Body is a message body made up of N bytes and contains specific key/value messages.

What is the concept of consumer group in 14.kafka

It is also a logical concept and a means for Kafka to implement unicast and broadcast message models. Data from the same topic will be broadcast to different group; worker in the same group, and only one worker can get this data. In other words, for the same topic, each group can get all the same data, but the data can only be consumed by one of the worker after entering the group. Worker in group can be implemented using multiple threads or processes, or processes can be scattered across multiple machines. The number of worker usually does not exceed the number of partition, and it is best to maintain an integer multiple relationship between the two, because Kafka is designed to assume that a partition can only be consumed by one worker (within the same group).

Will messages in 15.Kafka be lost and reconsumed?

To determine whether the message of Kafka is lost or duplicated, we should start from two aspects: message sending and message consumption.

1. Message sending

There are two ways to send Kafka messages: synchronous (sync) and async (synchronous). The default is synchronous, which can be configured through the producer.type property. Kafka confirms the production of messages by configuring the request.required.acks property:

0Mui-does not confirm whether the message is received successfully; 1Mui-indicates that it is confirmed when the Leader is received successfully;-1Murray-indicates that both Leader and Follower are successfully received.

To sum up, there are six kinds of message production. The following is a case-by-case analysis of the scenarios in which messages are lost:

(1) if acks=0 does not confirm the receipt of messages with the Kafka cluster, the message may be lost when the network is abnormal, the buffer is full, etc.

(2) in acks=1 and synchronous mode, only Leader can hang up after confirming that the reception is successful, the copy is not synchronized, and the data may be lost.

2. Message consumption

Kafka message consumption has two consumer interfaces, Low-level API and High-level API:

Low-level API: consumers maintain offset equivalents by themselves to achieve complete control of Kafka.

High-level API: encapsulates the management of parition and offset, easy to use

If you use the advanced interface High-level API, there may be a problem, that is, when message consumers take messages out of the cluster and submit new message offset values, they die before they have time to consume, so the messages that have not been consumed successfully will "creepy" disappear the next time they are consumed.

Solution:

For message loss: in synchronous mode, the confirmation mechanism is set to-1, that is, the message is written to Leader and Follower before confirming that the message is sent successfully; in asynchronous mode, in order to prevent the buffer from being full, you can set no limit on the blocking timeout in the configuration file and keep the producer in the blocking state when the buffer is full.

Repeat for the message: save the unique identity of the message to the external media and determine whether it has been processed or not each time it is consumed.

16. Why doesn't Kafka support read-write separation?

In Kafka, the operations of producer writing message and consumer reading message interact with leader copy, and the slave realizes a production and consumption model of main writer and main read.

Kafka does not support primary write-slave reading because it has two obvious disadvantages:

(1) data consistency. When the data is transferred from the master node to the slave node, there must be a delayed time window, which will lead to data inconsistency between the master and slave nodes. At some point, the value of A data in both the master node and the slave node is X, and then the value of An in the master node is changed to Y, so before the change is notified to the slave node, the value of A data read by the application in the slave node is not the latest Y, which leads to the problem of data inconsistency.

(2) delay problem. For components like Redis, the process of data writing to synchronization to slave node needs to go through the stages of network → master node memory → network → slave node memory, and the whole process will take a certain amount of time. In Kafka, master-slave synchronization is more time-consuming than Redis. It needs to go through the stages of network → master node memory → master node disk → network → slave node memory → slave node disk. For delay-sensitive applications, the function of master-write-slave reading is not very suitable.

How is message ordering reflected in 17.Kafka?

The messages in each partition of kafka are orderly when they are written. When consuming, each partition can only be consumed by one consumer in each group, which ensures that the consumption is orderly. The whole topic is not guaranteed to be orderly. To keep the topic in order, adjust the partition to 1. 0.

18. When consumers submit consumption shifts, are they submitting the offset or offset+1 of the latest news they are currently consuming?

Offset+1

How does 19.kafka implement delay queues?

Instead of using the Timer or DelayQueue that comes with JDK to implement the delay function, Kafka customizes a timer (SystemTimer) based on the time wheel. The average time complexity of JDK's Timer and DelayQueue insert and delete operations is O (nlog (n)), which can not meet the high performance requirements of Kafka. Based on the time round, the time complexity of both insert and delete operations can be reduced to O (1). The application of time wheel is not unique to Kafka, and there are many application scenarios, such as Netty, Akka, Quartz, Zookeeper and other components.

The underlying implementation is an array, and each element in the array can hold a TimerTaskList object. TimerTaskList is a circular bi-directional linked list in which the real scheduled task TimerTask is encapsulated in the linked list item TimerTaskEntry.

How on earth does time advance in Kafka? The timer in Kafka uses DelayQueue in JDK to help advance the time wheel. The specific approach is that each TimerTaskList used will be added to the DelayQueue. TimingWheel in Kafka is specifically used to insert and delete TimerTaskEntry, while DelayQueue is responsible for time-advancing tasks. Imagine again that the expiration of the first timeout task list in DelayQueue is 200ms, and the second timeout task is 840ms, where it only takes O (1) time complexity to get the head of the DelayQueue. If time-per-second push is used, 199 of the 200 propulsions performed when the first timeout task list is obtained belong to "air propulsion", while 639 "air propulsion" is required when the second timeout task is obtained, which will consume the machine's performance resources for no reason. Here, DelayQueue is used to assist in exchanging a small amount of space for time, thus achieving "precise propulsion". The timer in Kafka is really "knowing people to make good use of", using TimingWheel to do the best tasks to add and delete operations, while using DelayQueue to do the best time promotion work, complement each other.

What are the answers to the questions about kafka? I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.