How to analyze the basic principles of Kafka 07/06 Update SLTechnology News&Howtos

How to analyze the basic principles of Kafka

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces you how to analyze the basic principles of Kafka, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Brief introduction

Apache Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a fast, extensible, inherently distributed, partitioned, and replicable submission log service.

Kafka architecture

Its architecture includes the following components:

Topic (Topic): is a specific type of message flow. A message is a byte payload (Payload), and the topic is the category or Feed name of the message.

Producer: any object that can post a message to a topic.

Service agents (Broker): published messages are stored in a set of servers called agents (Broker) or Kafka clusters.

Consumers (Consumer): you can subscribe to one or more topics and pull data from Broker to consume these published messages.

Kafka storage policy

1) kafka uses topic for message management. Each topic contains multiple partition, and each partition corresponds to a logical log, which is composed of multiple segment.

2) multiple messages are stored in each segment (see the following figure). The message id is determined by its logical location, that is, the message id can be directly located to the storage location of the message to avoid additional id-to-location mapping.

3) each part corresponds to an index in memory, recording the offset of the first message in each segment.

4) the messages sent by the publisher to a topic will be evenly distributed to multiple partition (or according to the routing rules specified by the user), and the broker receives the release message to add the message to the last segment of the corresponding partition. When the number of messages on a segment reaches the configuration value or the message publishing time exceeds the threshold, the message on the segment will be flush to disk, and only the message subscribers on the flush to disk can subscribe to it. When the segment reaches a certain size, no more data is written to the segment, and the broker creates a new segment.

Kafka deletion Policy

1) deletion N days ago.

2) keep the most recent MGB data.

Kafka broker

Unlike other messaging systems, Kafka broker is stateless. This means that consumers must maintain the status information they have spent. This information is maintained by consumers themselves and is completely ignored by broker (managed by offset managerbroker).

Deleting a message from the agent becomes tricky because the agent does not know whether the consumer has used the message. Kafka creatively solves this problem by applying a simple time-based SLA to retention policies. When the message is in the agent for more than a certain period of time, it will be automatically deleted.

This innovative design has great benefits, and consumers can deliberately go back to the old offset and consume data again. This violates the common conventions of queues, but turns out to be a basic feature of many consumers.

The following is an excerpt from the official kafka documentation:

Kafka Design

target

1) High throughput to support high-capacity event flow processing

2) support loading data from offline systems

3) low latency message system

Persistence

1) dependent on the file system and persisted locally

2) data persistence to log

Efficiency

1) resolve "small IO problem":

Use "message set" to combine messages.

Server uses "chunks of messages" to write to log.

Consumer fetches large message blocks at a time.

2) solve "byte copying":

Use a unified binary message format between producer, broker, and consumer.

Use the pagecache of the system.

Use sendfile to transfer log and avoid copying.

End-to-end bulk compression (End-to-end Batch Compression)

Kafka supports GZIP and Snappy compression protocols.

The Producer

Load balancing

1) producer can customize the routing rules for which partition to send. Default routing rule: hash (key)% numPartitions. If key is null, a partition is randomly selected.

2) Custom routing: if key is a user id, messages of the same user can be sent to the same partition, and then consumer can read messages of the same user from the same partition.

Asynchronous batch transmission

Batch sending: configure no more than a fixed number of messages to be sent together and the waiting time is less than a fixed delay.

The Consumer

Consumer controls the reading of messages.

Push vs Pull

1) producer push data to broker,consumer pull data from broker

2) the advantage of consumer pull: consumer controls the reading speed and number of messages.

3) disadvantages of consumer pull: if broker does not have data, pull may have to wait many times. Kafka can configure consumer long pull to wait until data is available.

Consumer Position

1) most messaging systems record which messages are consumed by broker, but Kafka is not.

2) Kafka the consumption of messages is controlled by consumer, and consumer can even return to the location of an old offset to consume messages again.

Message Delivery Semantics

Three kinds:

At most once-Messages may be lost but are never redelivered.

At least once-Messages are never lost but may be redelivered.

Exactly once-this is what people actually want, each message is delivered once and only once.

Producer: there is a "acks" configuration that controls when the received leader is successfully written in response to the producer message.

Consumer:

* read messages, write log, and process messages. If the processing of the message fails and the log has been written, the failed message cannot be processed again, corresponding to "At most once".

* read messages, process messages, and write log. If the message processing succeeds and log fails to write, the message will be processed twice, corresponding to "At least once".

* read the message, process the message and write result and log at the same time. This ensures that both result and log are updated or fail at the same time, corresponding to "Exactly once".

Kafka guarantees at-least-once delivery by default and allows users to implement at-most-once semantics. The implementation of exactly-once depends on the destination storage system. Kafka provides read offset, and there is no problem with the implementation.

Copy (Replication)

1) the number of copies of a partition (replication factor) includes the leader of the partition itself.

2) all reads and writes to partition are done through leader.

3) Followers acquires log (message and offset) on leader through pull

4) if a follower hangs up, gets stuck, or synchronizes too slowly, leader removes the follower from the "in sync replicas" (ISR) list.

5) when all "in sync replicas" follower writes a message to their own log, the message is considered "committed".

6) if all replication nodes for a partition are dead, Kafka selects the node that was first resurrected as the leader (this node is not necessarily in the ISR).

Log compression (Log Compaction)

1) for the partition of a topic, compression causes the Kafka to know at least the last value corresponding to each key.

2) Compression does not reorder messages.

3) the offset of the message will not change.

4) the offset of messages is sequential.

Distribution

Consumer Offset Tracking

1) High-level consumer records the maximum offset consumed by each partition and periodically commit to offset manager (broker).

2) Simple consumer needs to manage offset manually. Today's Simple consumer Java API only supports commit offset to zookeeper.

Consumers and Consumer Groups

1) consumer registers to zookeeper

2) consumer belonging to the same group (like group id) distributes partition equally, and each partition is consumed by only one consumer.

3) consumer rebalance occurs when the state of broker or other consumer of the same group changes.

Zookeeper coordinated control

1) manage the dynamic joining and leaving of broker and consumer.

2) trigger load balancer. When broker or consumer joins or leaves, the load balancing algorithm will be triggered to balance the subscription load of multiple consumer in a consumer group.

3) maintain consumption relationship and consumption information of each partition.

On how to analyze the basic principles of Kafka to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.