The basic principle and function of kafka 07/09 Update SLTechnology News&Howtos

The basic principle and function of kafka

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the basic principle and function of kafka". the explanation in this article is simple and clear, easy to learn and understand. please follow the ideas of Xiaobian and go deep into it slowly to study and learn "the basic principle and function of kafka" together.

What does Kafka mean as a distributed streaming platform?

We believe that a stream processing platform has three key capabilities:

Publish and subscribe to messages (streams), in this respect it is similar to a message queue or enterprise messaging system.

Store messages (streams) in a fault-tolerant manner.

Process message flows as they occur.

What are the advantages of kakfa? It is mainly used in two categories of applications:

Build real-time streaming data pipelines that reliably capture data between systems and applications.

Build real-time streaming applications that transform or react to data streams.

To understand how kafka does these things, let's delve into kafka's abilities from the bottom up.

First of all, a few concepts:

Kafka runs as a cluster on one or more servers.

Kafka clusters store messages in the category of topic.

Each message (also known as a record, I am used to calling it a message) consists of a key, a value, and a timestamp.

Kafka has four core APIs:

An application uses the Producer API to publish messages to one or more topics.

Applications use the Consumer API to subscribe to one or more topics and process the resulting messages.

An application uses the Streams API to act as a stream processor, consuming input streams from one or more topics and producing an output stream to one or more output topics, effectively transforming input streams into output streams.

Connector APIs allow you to build or run reusable producers or consumers, connecting topics to existing applications or data systems. For example, a connector for a relational database can capture every change.

Communication between Client and Server is via a simple, high-performance and language-independent TCP protocol. And the protocol remains compatible with older versions. Kafka provides Java Client. In addition to Java Client, there are many other programming language clients.

Let's start with the basic term Kafka uses: Topic.

Kafka categorizes message feeds, and each type of message is called a topic.

Producer

The object that publishes the message is called the Kafka topic producer

Consumer

Objects that subscribe to messages and process the seeds of published messages are called topic consumers

Broker

Published messages are stored in a set of servers called Kafka clusters. Each server in the cluster is a broker. Consumers can subscribe to one or more topics and consume these published messages by pulling data from brokers.

Topics and logs (Topic and Log)

Let's take a closer look at the Topic in Kafka.

Topic is the category or seed Feed name of the posted message. For each Topic, the Kafka cluster maintains a log of that partition, as shown in the example below:

Each partition is a sequential, immutable message queue that can be added continuously. Messages in a partition are assigned a sequence number called an offset, which is unique in each partition.

Kafka clusters keep all messages until they expire, regardless of whether the messages are consumed or not. In fact, the only metadata the consumer holds is this offset, which is where the consumer is in the log. This offset is controlled by the consumer: normally the offset increases linearly as the consumer consumes the message. But the actual offset is controlled by the consumer, who can reset the offset to an older offset and reread the message. It can be seen that this design is easy for consumers to operate, and the operation of one consumer does not affect the processing of this log by other consumers. Let's talk about zoning. The zoning design used in Kafka serves several purposes. One is that it can handle more messages without being limited by a single server. Topic has multiple partitions, which means it can process more data without restrictions. Second, partitioning can be used as a unit of parallel processing, as we'll see later.

Distribution

The partitions of Log are distributed across multiple servers in the cluster. Each server handles its assigned partition. Depending on the configuration, each partition can also be replicated to other servers as a backup fault tolerance. Each partition has one leader and zero or more followers. The Leader handles all read and write requests for this partition, while the follower passively copies data. If the leader goes down, one of the other followers is elected as the new leader. A server may be both a leader of one partition and a follower of another partition. This balances the load and prevents all requests from being handled by only one or a few servers.

Geo-Replication

Kafka Mirror Maker provides geo-replication support for clusters. With Mirror Maker, messages can be replicated across multiple data centers or cloud regions. You can use it for backup and recovery in an active/passive scenario; or to place data closer to users in an active/passive scenario, or to localize data.

Producers (Producers)

Producers post messages to a Topic. Producers are also responsible for choosing which partition to publish to Topic. The easiest way is to rotate from the list of partitions. Partitions may also be selected according to weights according to some algorithm. The developer is responsible for how the algorithm for partitioning is chosen.

consumers (Consumers)

In general, messaging models can be divided into two types, queues and publish-subscribe. Queue processing is a group of consumers read messages from the server, a message only one of the consumers to process. In the publish-subscribe model, a message is broadcast to all consumers, and those consumers who receive it can process it. Kafka provides a single abstract model of consumers for both models: consumer groups. Consumers identify themselves with a consumer group name. A message posted on Topic is distributed to a consumer in this consumer group. If all consumers are in a group, this becomes a queue model. If all consumers are in different groups, then it becomes a complete publish-subscribe model. More generally, we can create groups of consumers as logical subscribers. Each group contains an unequal number of consumers, and multiple consumers within a group can be used to scale performance and fault tolerance. As shown in the figure below:

Two kafka clusters host four partitions (P0-P3), two consumer groups, two consumer instances for consumer group A and four consumer instances for consumer group B.

Just like traditional messaging systems, Kafka guarantees the same order of messages. A few more words. The traditional queue model holds messages and keeps them in the same order. However, even though the server guarantees the order of messages, messages are sent asynchronously to consumers, and the order in which messages are received by consumers is not guaranteed. This also means that parallel consumption will not guarantee the order of messages. Those of you who have used traditional messaging systems will know that the sequence of messages is a headache. If only one consumer processes the message, it violates the original intention of parallel processing. Kafka does better at this point, though it doesn't quite solve the problem. Kafka employs a divide-and-conquer strategy: zoning. Because messages in a Topic partition can only be processed by one consumer in a consumer group, messages must be processed sequentially. However, it only guarantees the sequential processing of a partition of a Topic, and cannot guarantee the sequential processing of messages across partitions. So, if you want to process all the messages of a Topic sequentially, provide only one partition.

Kafka Guarantees

If a producer sends messages to a particular Topic partition, the messages will be added in the order in which they are sent, i.e. if a message M1 and M2 are sent using the same producer and M1 is sent first, then M1 will have a lower offset than M2 and appear first in the log.

Messages received by consumers are also in this order.

If a Topic is configured with a replication factor of N, then N-1 servers can be allowed to go down without losing any committed messages.

For more details on these guarantees, see the Design section of the documentation.

Kafka as a messaging system

How does Kafka's concept of flow compare to traditional enterprise messaging systems?

Traditional messaging has two modes: queues and publish-subscribe. In queue mode, a pool of consumers reads messages from the server (each message is read by only one of them); publish subscribe mode: messages are broadcast to all consumers. Both models have advantages and disadvantages. The advantage of queues is that they allow multiple consumers to divide the data, which allows processing to scale. However, unlike multiple subscribers, queues lose messages once the messager process fails after reading them. Publish and subscribe allows you to broadcast data to multiple consumers, and since each subscriber subscribes to the message, there is no way to scale the process.

In kafka, there are two concepts of consumer group: queue: consumer group allows members of the consumer group with the same name to divide the processing. Publish Subscriptions: Allows you to broadcast messages to multiple consumer groups (with different names).

Each topic in kafka has these two patterns.

Kafka has stronger order guarantees than traditional messaging systems.

Traditional messaging systems store data sequentially, and if multiple consumers consume from a queue, the server sends messages in the order stored, but although the server sends them sequentially, messages are delivered asynchronously to consumers, so messages may arrive at consumers out of order. This means that there is parallel consumption of messages, and the order cannot be guaranteed. Message systems often solve this problem by having only one consumer, but this means parallel processing is not used.

Kafka did better. Order assurance and Load Balancer are provided by parition -- kafka of parallel topics. Each partition is consumed by only one consumer in the same consumer group. And ensure that consumers are the only consumers of that partition and consume data sequentially. Each topic has multiple partitions, so you need to do Load Balancer for multiple consumers, but please note that there cannot be more consumers than partitions in the same consumer group, otherwise the extra consumers will always be waiting empty and will not receive messages.

Kafka as a storage system

All systems that publish messages to message queues and consume them separately effectively act as a storage system (published messages are stored first). The advantage of Kafka over other systems is that it is a very high performance storage system.

Data written to kafka is written to disk and replicated to the cluster for fault tolerance. And allows the producer to wait for a message reply until the message is completely written.

Kafka's disk structure-whether you have 50KB or 50TB on your server, execution is the same.

Client controls where data is read. You can also think of kafka as a distributed file system dedicated to high performance, low latency, commit log storage, replication, and propagation for special purposes.

Kafka's stream processing

Reading, writing and storing is not enough; kafka aims for real-time streaming.

In kafka, stream processing continues to fetch data from the input topic, process it, and write it to the output topic. For example, a retail APP receives input streams of sales and shipments, counts quantities or adjusts prices for output.

Simple processing can be done directly using the producer and consumer APIs. For complex transformations, Kafka provides a more powerful Streams API. Complex applications can be built that aggregate calculations or connect flows together.

Help solve the hard problems faced by such applications: processing disordered data, reprocessing code changes, performing state calculations, etc.

The Sterams API is at the heart of Kafka: using the producer and consumer APIs as inputs, leveraging Kafka for state storage, and using the same group mechanism for fault tolerance between stream processor instances.

put together

The combination of messaging, storage, and streaming may seem anomalous, but it's critical to Kafka's role as a streaming platform.

Distributed file systems like HDFS allow static files to be stored for batch processing. such a system can efficiently store and proces historical data from that past.

Traditional enterprise messaging systems allow future messages to be processed after you subscribe: future data is processed as it arrives.

Kafka combines these two capabilities, and this combination is critical for Kafka as a platform for stream processing applications and stream data pipelines.

Batch and streaming concepts for message-driven applications: By combining storage and low-latency subscriptions, streaming applications can treat past and future data in the same way. It is a single application that can process historical stored data, and when it processes the last message, it goes into waiting for future data to arrive instead of ending.

Similarly, for streaming data pipelines, the combination of subscribing to real-time events makes it possible to use Kafka for very low-latency pipelines; however, the ability to reliably store data makes it possible to use it for critical data that must be guaranteed to deliver, or to integrate with offline systems that only load data periodically or maintain for long periods. Stream processing can transform data as it arrives.

Thank you for your reading. The above is the content of "the basic principle and function of kafka." After studying this article, I believe everyone has a deeper understanding of the basic principle and function of kafka. The specific use situation still needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.