The underlying principle of Kafka 07/03 Update SLTechnology News&Howtos

The underlying principle of Kafka

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

The underlying principle of Kafka, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Introduction to Kafka

Apache Kafka is a distributed publish-subscribe messaging system. Is the only king in the message queue in big data's field. Originally developed by linkedin in the scala language, it contributed to the Apache Foundation in 2010 and became a top open source project. It has been more than ten years, it is still an indispensable and increasingly important component in the field of big data.

Kafka is suitable for both offline and online messages, which are kept on disk and replicated within the cluster to prevent data loss. Kafka is built on top of the zookeeper synchronization service. It has a very good integration with Flink and Spark and is applied to real-time streaming data analysis.

Kafka features:

Reliability: with copy and fault tolerance mechanism.

Scalability: kafka can expand nodes and go online without downtime.

Persistence: data is stored on disk and persisted.

Performance: kafka has high throughput. Data that reaches the TB level also has very stable performance.

High speed: sequential write and zero copy technology keep kafka latency at millisecond level.

Basic principle of Kafka

Take a look at the architecture of the Kafka system first.

Kafka architecture

Kafka supports message persistence. The consumer side actively pulls data, and the consumption status and subscription relationship are maintained by the client. After the message is consumed, it will not be deleted immediately, and the historical message will be retained. Therefore, when multiple subscriptions are supported, only one copy of the message will be stored.

A broker:kafka cluster contains one or more service instances (nodes), which are called broker (a broker is a node / server).

Topic: every message posted to the kafka cluster belongs to a category called topic

Partition:partition is a physical concept. Each topic contains one or more partition.

Segment: there are multiple segment file segments in a partition, and each segment is divided into two parts, .log file and .index file, where .index file is an index file, which is mainly used for quick query, and the offset location of data in .log file.

Producer: the producer of the message, responsible for publishing the message to the broker of kafka

Consumer: the consumer of the message, the client that reads the message to the broker of kafka

Consumer group: consumer group, where each consumer belongs to a specific consumer group (groupName can be specified for each consumer)

.log: store data files

.index: stores the index data of the .log file.

Kafka main component 1. Producer (producer)

Producer is mainly used for producing messages. It is the producer of messages in kafka. The produced messages are classified through topic and saved to the broker of kafka.

2. Topic (theme)

Kafka classifies messages in units of topic

Topic specifically refers to the different categories of message sources (feeds of messages) processed by kafka

Topic is the nominal name of a series of records that are classified or published. Kafka topics always support multi-user subscriptions; that is, a topic can have zero, one or more consumer subscriptions to write data

In a kafka cluster, there can be countless topics

Producer and consumer consumption data are generally based on themes. Finer granularity can reach the partition level.

3. Partition (partition)

In kafka, topic is the classification of messages, a topic can have multiple partition, each partition stores part of the topic data, all the data in the partition are merged, that is, all the data in a topic.

Under one broker service, multiple partitions can be created. The number of broker has nothing to do with the number of partitions.

In kafka, each partition has a number: the number starts at 0.

The data in each partition is ordered, but the global data is not guaranteed to be ordered. (order refers to the order of production and consumption.)

4. Consumer (Consumer)

Consumer is the consumer in kafka, which is mainly used to consume the data in kafka. Consumers must belong to a certain consumer group.

5. Consumer group (Consumer Group)

A consumer group consists of one or more consumers, and consumers in the same group consume the same message only once.

Each consumer belongs to a consumer group, and if not specified, all consumers belong to the default group.

Each consumer group has an ID, or group ID. All consumers in the group coordinate to consume all the partition of a subscription topic (topic). Of course, each partition can only be consumed by one consumer (consumer) in the same consumer group, and can be consumed by different consumer groups.

The number of partition determines the maximum number of concurrent consumers in each consumer group. As shown below:

Example 2

As shown in the figure above, different consumer groups consume the same topic, and the topic has four partitions, distributed on two nodes. Consumer group 1 on the left has two consumers, each consumer has to consume two partitions to complete the message, and consumer group 2 on the right has four consumers, each consumer consumes one partition.

Summarize the relationship between partition and consumer group in kafka:

Consumer group: consists of one or more consumers, and consumers in the same group consume the same message only once.

The number of partitions under a topic should be less than or equal to the number of partitions under the theme for the number of consumers under the same consumer group that consumes the topic.

For example, if a topic has 4 partitions, then the consumers in the consumer group should be less than or equal to 4, and preferably an integer multiple of 1 / 2 / 4 with the number of partitions The data under the same partition cannot be consumed by different consumers of the same consumption group at the same time.

Summary: the more the number of partitions, the more consumers can spend at the same time, the faster the speed of consumption data, and improve the performance of consumption.

6. Partition replicas (partition copy)

The partition copy in kafka is shown in the following figure:

.index and .log

The left half of the image above is an index file, in which a pair of key-value is stored, where key is the number of the message in the data file (the corresponding log file), such as "1meme 3pm 6pm 8 …"

Respectively represents the 1st message, 3rd message, 6th message, 8th message in the log file.

So why aren't these numbers contiguous in the index file?

This is because the index file does not index every message in the data file, but uses sparse storage to build an index every certain byte of data.

This prevents the index file from taking up too much space, so that the index file can be kept in memory.

But the disadvantage is that the Message that is not indexed can not locate its location in the data file at once, so it needs to do a sequential scan, but the scope of this sequential scan is very small.

Value represents the number of messages in the global partiton.

Take the metadata 3497 in the index file as an example, where 3 represents the third message from top to bottom in the log data file on the right

497 indicates that the physical offset address (location) of the message is 497. it also means that the 497th message-sequential write feature is represented in the global partiton.

Log log directory and its composition

Kafka creates some folders under the log.dir directory we specified; the name is the folder made up of (topic name-partition name). In the (topic name-partition name) directory, there will be two files, as follows:

# Index file 00000000000000000000.index# log content 00000000000000000000.log

The files in the directory will be split according to the size of the log log. When the size of the .log file is 1G, the file will be split as follows:

-rw-r--r--. 1 root root 389k January 17 18:03 000000000000000000.indexMurray rwmuri Rafael. 1 root root 1.0G January 17 18:03 0000000000000000.logmurr Murray RW Murray. 1 root root 10m January 17 18:03 000000000000077894.indexMurray 1 root root 127m January 17 18:03 00000000000000077894.log

In the design of kafka, the offset value is included as part of the file name.

Segment file naming convention: the first segment of the partion global starts at 0, and each subsequent segment file name is the maximum offset (offset message number) of the previous global partion. The maximum numeric value is 64-bit long size, 20-digit character length, and no number is filled with 0.

You can quickly locate the message through the index information. The IO disk operation of segment File can be avoided by mapping all index metadata to memory

Through the sparse storage of index files, the footprint of index file metadata can be greatly reduced.

Sparse index: create an index for data, but the scope is not created for each item, but for an interval.

Benefit: it can reduce the number of index values.

The downside: after finding the index interval, you have to deal with it a second time.

8. Physical structure of message

Every message sent by the producer to the kafka is packaged as a message by kafka

No data loss mechanism in kafka 1. Producer production data does not lose the way to send messages

Producers send data to kafka, either synchronously or asynchronously

Synchronization mode:

After sending a batch of data to kafka, wait for kafka to return the result:

The producer waits for 10 seconds, and if the broker does not give an ack response, it is considered a failure.

The producer retries 3 times and reports an error if there is no response.

Asynchronous mode:

Send a batch of data to kafka, only providing a callback function:

First save the data in the buffer on the producer side. The size of buffer is 20, 000.

Data can be sent if one of the conditions of data threshold or quantity threshold is met.

The size of a batch of data sent is 500.

Note: if broker is slow to give ack and the buffer is full, the developer can set whether to empty the data in the buffer directly.

Ack mechanism (confirmation mechanism)

When the producer data is sent, the server needs to return a confirmation code, that is, the ack response code. The response of ack has three status values: 0memlore 1

0: the producer is only responsible for sending data and does not care whether the data is lost. The lost data needs to be sent again.

The leader of 1:partition receives the data. Regardless of whether the follow has synchronized the data or not, the status code of the response is 1.

-1: all slave nodes receive data, and the status code of the response is-1

If the broker side does not return the ack status, producer will never know whether it is successful or not; producer can set a timeout period of 10s, which is considered a failure.

2. Data is not lost in broker

In broker, to ensure that data is not lost is mainly through the copy factor (redundancy) to prevent data loss.

3. Consumer consumption data is not lost

When consumers consume data, as long as each consumer records the offset value, the data can be guaranteed not to be lost. That is, we need to maintain the offset (offset) ourselves, which can be saved in Redis.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.