What are the Kafka interview questions? 07/13 Update SLTechnology News&Howtos

What are the Kafka interview questions?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the Kafka interview questions?" in the operation of the actual case, many people will encounter such a dilemma, and then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Why use kafka?

Buffering and peaking: there is sudden traffic in the upstream data, which may not be supported downstream, or there are not enough machines downstream to ensure redundancy. Kafka can act as a buffer in the middle. The message is temporarily stored in the kafka, and the downstream service can process it slowly at its own pace.

Decoupling and extensibility: specific requirements cannot be determined at the beginning of the project. Message queuing can act as an interface layer to decouple important business processes. You only need to follow the convention and program against the data to gain scalability.

Redundancy: in an one-to-many manner, one producer publishes messages, which can be consumed by multiple services that subscribe to topic for use by multiple unrelated businesses.

Robustness: message queues can stack requests, so even if the consumer business dies for a short time, it will not affect the normal operation of the main business.

Asynchronous communication: in many cases, users do not want or need to process messages immediately. Message queuing provides an asynchronous processing mechanism that allows users to put a message on the queue, but does not process it immediately. Put as many messages as you want in the queue, and then process them when needed.

2. How to re-consume the messages consumed by Kafka?

The offset of kafka consumption messages is defined in zookeeper. If you want to consume kafka messages repeatedly, you can record the checkpoint points of offset (n) in redis. When you want to consume messages repeatedly, you can reset the offset of zookeeper by reading the checkpoint points in redis, so that you can achieve the purpose of consuming messages repeatedly.

3. Is the data of kafka stored on disk or in memory? why is it fast?

Kafka uses disk storage.

The speed is fast because:

Sequential write: because the hard disk is a mechanical structure, each read and write will address-> write, where addressing is a "mechanical action", which is time-consuming. So the hard drive "hates" random Imax O, and likes sequential Imax O. In order to improve the speed of reading and writing to the hard disk, Kafka uses sequential I _ hand O.

Memory Mapped Files (memory Mapping File): a 20G data file can be represented in a 64-bit operating system. Its working principle is to directly use the Page of the operating system to realize the direct mapping of files to physical memory. After the mapping is completed, your operations on physical memory will be synchronized to the hard drive.

Kafka efficient file storage design: Kafka divides a large parition file in topic into multiple small file segments. Through multiple small file segments, it is easy to regularly clear or delete consumed files and reduce disk footprint. You can quickly locate through the index information.

Message and determine the size of the response. Map all metadata to memory (memory-mapped files) through index

The IO disk operation of segment file can be avoided. Through the sparse storage of index files, the footprint of index file metadata can be greatly reduced.

Note:

One of the ways Kafka solves query efficiency is by segmenting data files, such as 100 Message, whose offset ranges from 0 to 99. Suppose you divide the data file into five segments, the first segment is 0-19, the second segment is 20-39, and so on, each segment is placed in a separate data file named after the small offset in that paragraph. In this way, find the specified offset's

When you Message, you can use a binary search to locate which segment the Message is in.

Indexing the data file data file segmentation makes it possible to find the Message of the corresponding offset in a smaller data file, but it still requires a sequential scan to find the Message of the corresponding offset.

In order to further improve the efficiency of lookup, Kafka establishes an index file for each segmented data file, and the file name is the same as the name of the data file, except that the file extension is .index.

4. How to ensure that Kafka data is not lost?

It is divided into three points, one is the producer side, one is the consumer side, and the other is the broker side.

No loss of producer data

Kafka's ack mechanism: when kafka sends data, there is an acknowledgement feedback mechanism every time a message is sent, which ensures that the message can be received normally.

If it is in synchronous mode:

Ack is set to 0, which is very risky and is generally not recommended. Even if it is set to 1, data will be lost with leader downtime. So if you want to strictly ensure that the data on the production side is not lost, it can be set to-1.

If it is asynchronous mode:

The status of the ack will also be considered. In addition, there is a buffer in asynchronous mode, which controls the sending of data through buffer. There are two values to control, the time threshold and the number of messages threshold. If the buffer is full and the data has not been sent, there is an option to configure whether to empty the buffer immediately. Can be set to-1, permanent blocking, that is, the data is no longer produced. In asynchronous mode, even if set to-1. It may also be due to the programmer's unscientific operation, the loss of operation data, such as kill-9, but this is a special exception.

Note:

Without waiting for confirmation that the broker synchronization is complete, ack=0:producer continues to send the next (batch) message.

Ack=1 (default): producer waits for leader to successfully receive the data and get confirmation before sending the next message.

Ack=-1:producer receives follwer confirmation before sending the next piece of data.

No loss of consumer data

To ensure that the data is not lost through offset commit, kafka records the offset value of each consumption, and the next time it continues to consume, it will continue to consume the last offset.

On the other hand, the information of offset is saved in zookeeper before the kafka0.8 version and in topic after version 0.8. Even if the consumer dies while running, he or she will find the value of offset when starting again, find the location of the previous consumption message, and then consume. Because the information of offset is not written after every message consumption is completed, this situation may lead to repeated consumption. But the message will not be lost.

The only exception is that we set up two consumer groups that originally do different functions in the program.

When KafkaSpoutConfig.bulider.setGroupid is set to the same groupid, this situation will cause the two groups to share the same data, resulting in group A consuming messages in partition1,partition2 and group B consuming partition3 messages, so that the messages consumed by each group will be lost and incomplete. In order to ensure that each group has its own message data, groupid must not repeat it.

The data of broker in kafka cluster is not lost.

We usually set the number of replication (copies) for the partition in each broker. When the producer writes it, the producer first writes it to the leader according to the distribution policy (some partition press partition, some key press key, no polling), and then follower (copy) synchronizes the data with leader, so that the backup can ensure that the message data is not lost.

5. Why choose kafka to collect data?

The acquisition layer can mainly use Flume, Kafka and other technologies.

Flume:Flume is a pipeline flow mode, which provides many default implementations that allow users to deploy through parameters and extend API.

Kafka:Kafka is a persistent distributed message queue. Kafka is a very general system. You can have many producers and many consumers share multiple theme Topics.

By contrast, Flume is a dedicated tool designed to send data to HDFS,HBase. It has special optimization for HDFS and integrates the security features of Hadoop.

Therefore, Cloudera recommends using kafka; if the data is consumed by multiple systems, using Flume if the data is designed for use by Hadoop.

6. Will kafka restart result in data loss?

Kafka writes data to disk and generally does not lose data.

However, in the process of restarting kafka, if there is a consumer consumption message, then if the kafka fails to submit the offset in time, the data may be inaccurate (loss or repeated consumption).

7. How to solve the problem when kafka goes down?

First consider whether the business is affected.

When kafka goes down, the first thing we should consider is whether the service provided is affected by the down machine. If there is no problem with service provision, if the implementation has completed the disaster recovery mechanism of the cluster, then there is no need to worry about this part.

Node troubleshooting and recovery

If you want to restore the nodes in the cluster, the main step is to use log analysis to see the cause of the node downtime, so as to solve the problem and restore the node.

8. Why doesn't Kafka support read-write separation?

In Kafka, the operations of producer writing message and consumer reading message interact with leader copy, and the slave realizes a production and consumption model of main writer and main read.

Kafka does not support primary write-slave reading because it has two obvious disadvantages:

Data consistency problem: there must be a delayed time window when the data is transferred from the master node to the slave node, which will lead to data inconsistency between the master and slave nodes. At some point, the value of A data in both the master node and the slave node is X, and then the value of An in the master node is changed to Y, so before the change is notified to the slave node, the value of A data read by the application in the slave node is not the latest Y, which leads to the problem of data inconsistency.

Delay problem: in components like Redis, the process of data from writing to synchronization to slave node needs to go through several stages of network → master node memory → network → slave node memory, the whole process will take a certain amount of time. In Kafka, master-slave synchronization is more time-consuming than Redis. It needs to go through the stages of network → master node memory → master node disk → network → slave node memory → slave node disk. For delay-sensitive applications, the function of master-write-slave reading is not very suitable.

On the other hand, the main writer of kafka has many advantages:

It can simplify the implementation logic of the code and reduce the possibility of errors.

The granularity of the load is refined and evenly distributed, and compared with the main write and slave read, not only the load performance is better, but also it is controllable to the user.

No effect of delay

When the copy is stable, there will be no data inconsistency.

This is the end of the content of "what are the Kafka interview questions"? thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.