What are the relevant knowledge points of Kafka 04/21 Update SLTechnology News&Howtos

What are the relevant knowledge points of Kafka

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what are the relevant knowledge points of Kafka". In the daily operation, I believe that many people have doubts about the relevant knowledge points of Kafka. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the questions of "what are the relevant knowledge points of Kafka?" Next, please follow the editor to study!

Let's talk about distributed message middleware

Interview questions:

What is distributed message middleware?

What is the function of message middleware?

What is the usage scenario of message middleware?

Selection of message middleware?

Message queue

Distributed message is a kind of communication mechanism, which is different from RPC, HTTP, RMI and so on. Message middleware uses distributed middleware to communicate.

As shown in the figure, after using the message middleware, the upstream business system sends messages, which are first stored in the message middleware, and then distributed to the corresponding business module applications (distributed producer-consumer model).

This asynchronous approach reduces the degree of coupling between services.

Architecture

Define message middleware:

Make use of efficient and reliable message delivery mechanism for platform-independent data exchange.

Based on data communication, the distributed system is integrated.

By providing a message passing and message queuing model, communication between processes can be extended in a distributed environment.

Referencing additional components in the system architecture is bound to increase the architectural complexity of the system and the difficulty of operation and maintenance, so what are the advantages of using distributed message middleware in the system?

What is the role of message middleware in the system?

Decoupling

Redundancy (storage)

Expansibility

Peak clipping

Recoverability

Sequence guarantee

Buffer

Asynchronous communication

During the interview, the interviewer is often concerned about the interviewer's ability to select open source components, which can test not only the breadth of the interviewer's knowledge, but also the depth of the interviewer's knowledge of a certain kind of system. and you can also see the interviewer's ability to grasp the system as a whole and design the system architecture.

There are many open source distributed messaging systems, and the characteristics of different messaging systems are also different. to choose what kind of messaging system, we not only need to have a certain understanding of each messaging system, but also need to have a clear understanding of their own system requirements.

The following is a comparison of several common distributed messaging systems:

Message queue selection

Answer keyword:

What is distributed message middleware? Communication, queuing, distributed, production-consumer model.

What is the function of message middleware? Decoupling, peak processing, asynchronous communication, buffering.

What is the usage scenario of message middleware? Asynchronous communication, message storage processing.

Selection of message middleware? Language, protocol, HA, data reliability, performance, transaction, ecology, simplicity, push-pull mode.

Basic concepts and architecture of Kafka

Interview questions:

Briefly talk about the architecture of Kafka?

Is Kafka in push mode or pull mode? what's the difference between push and pull?

How does Kafka broadcast messages?

Is the message from Kafka orderly?

Does Kafka support read-write separation?

How does Kafka ensure the high availability of data?

What is the role of Zookeeper in Kafka?

Do you support transactions?

Can the number of partitions be reduced?

Architecture diagram

General concepts in the Kafka architecture:

Producer: the producer, that is, the party that sends the message. The producer is responsible for creating the message and then sending it to the Kafka.

Consumer: the consumer, that is, the party who receives the message. The consumer connects to the Kafka and receives the message, which is processed by the corresponding business logic.

Consumer Group: a consumer group can contain one or more consumers. The use of multi-partition + multi-consumer approach can greatly improve the processing speed downstream of the data.

Consumers in the same consumption group do not repeat consumption messages, and similarly, consumer messages in different consumption groups do not affect each other. Kafka is to realize the message P2P mode and broadcast mode through the consumer group.

Broker: service proxy node. Broker is the service node of Kafka, that is, the server of Kafka.

Messages in Topic:Kafka are divided into Topic units. Producers send messages to a specific Topic, while consumers subscribe to Topic messages and consume them.

Partition:Topic is a logical concept that can be subdivided into multiple partitions, each of which belongs to a single topic.

Different partitions under the same topic contain different messages. Partitions can be regarded as an appendable Log file at the storage level, and messages are assigned a specific Offset when they are appended to the partition log file.

Offset:Offset is the unique identification of messages in the partition, and Kafka uses it to ensure the ordering of messages within the partition, but Offset does not cross partitions, that is, Kafka ensures partition ordering rather than topic ordering.

Replication: replica is a way for Kafka to ensure high availability of data. Kafka data of the same Partition can have multiple replicas on multiple Broker. Usually, only the master copy provides external reading and writing services. When the Broker where the master copy resides crashes or a network event occurs, Kafka will re-select a new Leader replica under the management of Controller to provide external reading and writing services.

Record: a message record that is actually written to the Kafka and can be read. Each Record contains key, value, and timestamp.

Kafka Topic Partitions Layout, as shown below:

Theme

Kafka partitions the Topic, and the partition can read and write concurrently.

Kafka Consumer Offset, as shown below:

Consumer Offset

The Zookeeper architecture is shown below:

Broker registration: Broker is distributed and independent of each other, and Zookeeper is used to manage all Broker nodes registered with the cluster.

Topic registration: in Kafka, messages from the same Topic are divided into multiple partitions and distributed on multiple Broker, and the partition information and the corresponding relationship with Broker are also maintained by Zookeeper.

Producer load balancing: since the same Topic message is partitioned and distributed across multiple Broker, the producer needs to send the message to these distributed Broker reasonably.

Consumer load balancing: like producers, consumers in Kafka also need to carry out load balancing to realize that multiple consumers reasonably receive messages from corresponding Broker servers. Each consumer packet contains several consumers, and each message is sent to only one consumer in the packet. Different consumer groups consume messages under their own specific Topic without interference.

Answer keyword:

Briefly talk about the architecture of Kafka? Producer, Consumer, Consumer Group, Topic, Partition.

Is Kafka a push mode or a pull mode? what's the difference between push and pull? Kafka Producer uses Push mode to send messages to Broker, while Consumer consumption uses Pull mode. Pull mode and let consumer manage the offset itself, which provides read performance.

How does Kafka broadcast messages? Consumer group.

Is the message of Kafka ordered? the Topic level is disordered and the Partition is ordered.

Does Kafka support read-write separation? No, only Leader provides read and write services.

How does Kafka ensure the high availability of data? Copy, Ack,HW.

What is the role of zookeeper in Kafka? Cluster management, metadata management.

Do you support transactions? transactions are supported after 0.11, and "exactly once" can be realized.

Can the number of partitions be reduced? No, data will be lost.

Kafka usage

Interview questions:

What command line tools does Kafka have? What have you used?

The implementation process of Kafka Producer?

What are the common configurations of Kafka Producer?

How to keep Kafka messages in order?

How does Producer ensure that data transmission is not lost?

How to improve the performance of Producer?

If the number of Consumer under the same Group is greater than the number of part, what does kafka do?

Is Kafka Consumer thread-safe?

Let's talk about the threading model when you use Kafka Consumer to consume messages. Why is it so designed?

What is the common configuration of Kafka Consumer?

When will Consumer be kicked out of the cluster?

How does Kafka react when Consumer joins or quits?

What is Rebalance and when will Rebalance occur?

Command line tool

Kafka's command line tools are located in the / bin directory of the Kafka package, which mainly includes service and cluster management scripts, configuration scripts, information view scripts, Topic scripts, client scripts, and so on.

Kafka-configs.sh: configuration management script.

Kafka-console-consumer.sh:kafka Consumer console.

Kafka-console-producer.sh:kafka producer console.

Information about kafka-consumer-groups.sh:kafka consumer groups.

Kafka-delete-records.sh: delete log files for low water levels.

Kafka-log-dirs.sh:kafka message log directory information.

Kafka-mirror-maker.sh: kafka cluster replication tools for different data centers.

Kafka-preferred-replica-election.sh: triggers the preferred replica election.

Kafka-producer-perf-test.sh:kafka producer performance test script.

Kafka-reassign-partitions.sh: partition reassignment script.

Kafka-replica-verification.sh: copy the progress verification script.

Kafka-server-start.sh: start the kafka service.

Kafka-server-stop.sh: stop the kafka service.

Kafka-topics.sh:topic management scripts.

Kafka-verifiable-consumer.sh: verifiable kafka consumers.

Kafka-verifiable-producer.sh: a verifiable kafka producer.

Zookeeper-server-start.sh: start the zk service.

Zookeeper-server-stop.sh: stop the zk service.

Zookeeper-shell.sh:zk client.

We can usually use kafka-console-consumer.sh and kafka-console-producer.sh scripts to test Kafka production and consumption.

Kafka-consumer-groups.sh can view and manage the Topic,kafka-topics.sh in the cluster. It is usually used to view the consumer groups of Kafka.

Kafka Producer

The normal production logic of Kafka producer consists of the following steps:

Configure producer client parameters common producer instances.

Build the message to be sent.

Send a message.

Close the producer instance.

The process of sending a message by Producer is shown in the following figure, which needs to go through the interceptor, serializer and divider, and finally the accumulator sends it to Broker in batch.

Producer

Kafka Producer requires the following required parameters:

Bootstrap.server: specifies the address of the Broker of the Kafka.

Key.serializer:key serializer.

Value.serializer:value serializer.

Common parameters:

Batch.num.messages. Default: 2000.The number of messages in each batch only works for asyc.

Request.required.acks. The default value: 0 producer does not need to wait for leader confirmation, 1 means leader confirmation is required to write to its local log and immediately confirm, and-1 means confirmation after all backups are completed.

It only works for async mode. The adjustment of this parameter is tradeoff with no data loss and transmission efficiency. If you are not sensitive to data loss but care about efficiency, you can consider setting it to 0, which can greatly improve the efficiency of producer sending data.

Request.timeout.ms. Default: 10000. Confirm the timeout.

Partitioner.class, default: kafka.producer.DefaultPartitioner, must implement kafka.producer.Partitioner and provide a partitioning policy according to Key.

Sometimes we need messages of the same type to be processed sequentially, so we have to customize the allocation policy to allocate the same type of data to the same partition.

Producer.type, default: sync, which specifies whether the message is sent synchronously or asynchronously. Asynchronous asyc is sent in batches with kafka.producer.AyncProducer, synchronous sync is sent with kafka.producer.SyncProducer. Synchronous and asynchronous transmission can also affect the efficiency of message production.

Compression.topic, default: none, message compression, no compression by default. Other compression methods include "gzip", "snappy" and "lz4". The compression of messages can greatly reduce the amount of network transmission and reduce the network IO, thus improving the overall performance.

Compressed.topics, default: null. If compression is set, you can specify a specific topic compression, or all compression if not specified.

Message.send.max.retries, default: 3, the maximum number of attempts to send a message.

Retry.backoff.ms, default value: 300, extra interval for each attempt.

Topic.metadata.refresh.interval.ms, default value: 600000, the time it takes to obtain metadata periodically.

Producer will also actively obtain metadata when the partition is lost and leader is not available. If 0, the metadata will be obtained every time a message is sent. It is not recommended. If it is a negative value, the metadata is obtained only if it fails.

Queue.buffering.max.ms, default: 5000, the maximum time of data cached in producer queue, only for asyc.

Queue.buffering.max.message, default value: 10000, maximum number of cached messages, only for asyc.

Queue.enqueue.timeout.ms, default:-1 queue.enqueue.timeout.ms 0 is discarded when queue is full, negative value is block when queue is full, positive value is the corresponding time of block when queue is full, only for asyc.

Kafka Consumer

Kafka has the concept of consumption group. Each consumer can only consume the message of the assigned partition, and each partition can only be consumed by one consumer in a consumption group. Therefore, if the number of consumers in the same consumption group exceeds the number of divisions, there will be some consumers who cannot be assigned to the division.

The relationship between the consumer group and the consumer is shown in the following figure:

Consumer Group

Kafka Consumer Client consumption messages usually include the following steps:

Configure clients and create consumers

Subscribe to topics

Pull the message and consume it

Submit consumption displacement

Close the consumer instance

Process

Because the Consumer client of Kafka is not thread-safe, in order to ensure thread safety and improve consumption performance, we can use a thread model similar to Reactor to consume data on the Consumer side.

Consumption model

Kafka Consumer parameters:

Bootstrap.servers: connection broker address, host:port format.

Group.id: the consumer group to which consumers belong.

Key.deserializer: the deserialization method of the key corresponding to the key.serializer of the producer.

Value.deserializer: the deserialization method of the value corresponding to the value.serializer of the producer.

The time when the session.timeout.ms:coordinator test failed. The default is 10s. This parameter is the interval at which Consumer Group actively detects crashes (comsummer, a member of the group), similar to heartbeat expiration time.

Auto.offset.reset: this attribute specifies what consumers should do if they read a partition with no offset, which is invalid (the consumer has failed for a long time, the current offset has been outdated and deleted). The default value is latest, that is, reading data from the latest record (the record generated after the consumer starts), and the other value is earliest, which means that the offset is invalid. The consumer reads the data from the starting position.

Enable.auto.commit: no automatic submission of displacement. If false, manual submission of displacement is required in the program. For semantics accurate to one time, it is best to submit the displacement manually.

Fetch.max.bytes: the maximum number of bytes to pull data at a time.

Max.poll.records: the maximum number of messages returned by a single poll call. If the processing logic is light, you can increase this value appropriately. However, the max.poll.records data needs to be processed within the time of session.timeout.ms. The default value is 500.

Request.timeout.ms: the maximum waiting time for a request response. If there is no response within the timeout, kafka either resends the message or sets it to fail if it exceeds the number of retries.

Kafka Rebalance

Rebalance is essentially an agreement that specifies how all Consumer under a Consumer Group agree to allocate each partition that subscribes to Topic.

For example, there are 20 Consumer under a Group that subscribes to a Topic with 100 partitions.

Normally, Kafka allocates an average of five partitions per Consumer. This process of allocation is called Rebalance.

①, when is Rebalance? This is also an issue that is often mentioned.

There are three triggering conditions for Rebalance:

Group membership changes (a new Consumer joins the group, an existing Consumer actively leaves the group, or an existing Consumer crashes-- the difference between the two will be discussed later)

The number of subscription topics has changed

The number of partitions for subscription topics has changed

How does ② allocate partitions within a group?

Kafka provides two allocation strategies by default: Range and Round-Robin. Of course, Kafka uses a pluggable allocation strategy, and you can create your own allocator to implement different allocation strategies.

Answer keyword:

What command line tools does Kafka have? Which directories have you used? / bin to manage Kafka clusters, manage Topic, and produce and consume Kafka.

The implementation process of Kafka Producer? Interceptor, serializer, divider and accumulator.

What are the common configurations of Kafka Producer? Broker configuration, Ack configuration, network and send parameters, compression parameters, Ack parameters.

How to make the messages of Kafka orderly? Kafka is unordered at the Topic level, but only on Partition, so to ensure the processing order, you can customize the divider to send the data that need to be processed sequentially to the same Partition.

How does Producer ensure that data transmission is not lost? Ack mechanism, retry mechanism.

How to improve the performance of Producer? Batch, asynchronous, compression.

If the number of consumer under the same group is greater than the number of part, what does kafka do? The excess Part will be in a useless state and will not consume data.

Is Kafka Consumer thread-safe? Unsafe, single-thread consumption, multithreaded processing

Let's talk about the threading model when you use Kafka Consumer to consume messages. Why is it so designed? Pull and deal with separation.

Common configuration of Kafka Consumer? Broker, network and pull parameters, heartbeat parameters.

When will Consumer be kicked out of the cluster? Collapse, network exception, processing time is too long to submit displacement timeout.

How does Kafka react when Consumer joins or quits? Conduct Rebalance.

What is Rebalance, when will Rebalance?Topic change, Consumer change.

High availability and performance

Interview questions:

How does Kafka ensure high availability?

What is the delivery semantics of Kafka?

What is the role of Replic?

What is it, AR,ISR?

What are Leader and Flower?

What do HW, LEO, LSO, LW, and so on represent in Kafka?

What does Kafka do to ensure superior performance?

Partition and copy

Partition copy

In distributed data systems, partitions are usually used to improve the processing power of the system, and replicas are used to ensure the high availability of data.

Multi-partitioning means the ability to process concurrently, of which only one copy is Leader, while the others are Follower copies.

Only Leader copies can provide services to the outside world. Multiple Follower copies are usually stored in a different Broker than the Leader copy.

Through such a mechanism to achieve high availability, when a machine dies, other Follower copies can also quickly "become regular" and begin to provide services.

① Why don't Follower replicas provide read service?

This problem is essentially a trade-off between performance and consistency. Imagine what would happen if a copy of Follower also provided services?

First of all, performance is bound to improve. But at the same time, there will be a series of problems. Similar to phantom reading and dirty reading in database transactions.

For example, if you now write a piece of data to Kafka topic a, consumer b consumes data from topic an and finds that it cannot be consumed, because the latest message has not been written in the partition copy that consumer b reads.

At this point, another consumer, c, can consume the latest piece of data because it consumes a copy of Leader.

Kafka uses the management of WH and Offset to determine what data Consumer can consume and what data has been currently written.

Watermark

② only Leader can provide external reading service, so how to elect Leader

Kafka places copies that are synchronized with Leader replicas into the ISR replica collection. Of course, Leader replicas always exist in the collection of ISR replicas, and in some special cases, there is even only one copy of Leader in ISR replicas.

When Leader hangs up, Kakfa senses this situation through Zookeeper, selects a new copy from the ISR copy to become Leader, and provides services.

But there is another problem. As mentioned earlier, it is possible that there is only Leader in the ISR copy collection. When the Leader copy is hung up, the ISR set is empty. What should I do?

At this point, if you set the unclean.leader.election.enable parameter to true, Kafka will select a copy as Leader in a copy that is out of sync, that is, not in the ISR replica collection.

The existence of ③ copy will lead to the problem of copy synchronization.

Kafka maintains a list of available replicas (ISR) among all allocated replicas (AR), and when Producer sends a message to Broker, it determines based on the ACK configuration that several replicas need to be synchronized before the message succeeds.

Broker internally uses ReplicaManager services to manage data synchronization between Flower and Leader.

Sync

Performance optimization:

Partition concurrency.

Read and write disks sequentially.

Page Cache: read and write by page.

Pre-read: Kafka reads the messages to be consumed into memory in advance.

High performance serialization (binary).

Memory mapping.

Lock-free Offset management: improve concurrency.

Java NIO model.

Batch: batch read and write.

Compression: message compression, storage compression, reducing network and IO overhead.

Partition concurrent

On the one hand, because different Partition can be located on different machines, we can make full use of the advantages of cluster to realize parallel processing between machines.

On the other hand, because Partition physically corresponds to a folder, even if multiple Partition are located in the same node, you can configure different Partition on the same node on different disk drive, so as to achieve parallel processing between disks and give full play to the advantages of multiple disks.

Sequential reading and writing

Kafka each file in the Partition directory is averagely cut into data files of equal size (the default file is 500MB, which can be set manually).

Each data file is called a segment file, and each segment appends data in the way of append.

Additional data

Answer keyword:

How does Kafka ensure high availability? Ensure the high availability of data through copies, Producer Ack, retry, automatic Leader election, Consumer self-balancing.

What is the delivery semantics of Kafka? Delivery semantics generally include at least once, at most once, and exactly once. Kafka implements the first two through the configuration of Ack.

What is the role of Replic? Achieve high availability of data.

What is AR,ISR?AR:Assigned Replicas. AR is the set of replicas assigned when the partition is created after the theme is created, and the number of replicas is determined by the replica factor.

ISR:In-Sync Replicas . A particularly important concept in Kafka refers to the collection of replicas in AR that are synchronized with Leader retention.

The copy in AR may not be in ISR, but the copy of Leader is naturally included in ISR. Another common interview question about ISR is how to determine whether a copy should belong to ISR.

At present, the judgment is based on whether the LEO of the Follower copy lags behind the Leader LEO and whether it exceeds the replica.lag.time.max.ms value of the Broker parameter. If it is exceeded, the copy is removed from the ISR.

What are Leader and Flower? See above.

What does the HW in Kafka stand for? High water level value (High watermark). This is an important field that controls the range of messages that consumers can read.

An average consumer can only "see" all messages on the Leader copy between Log Start Offset and HW (not included). News above the water level is invisible to consumers.

What does Kafka do to ensure superior performance? Partition concurrency, sequential read-write disk, Page Cache compression, high-performance serialization (binary), memory mapping, lock-free Offset management, Java NIO model.

At this point, the study of "what are the relevant knowledge points of Kafka" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.