What is the basic framework of Kafka 07/11 Update SLTechnology News&Howtos

What is the basic framework of Kafka

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "what is the basic framework of Kafka". In daily operation, I believe many people have doubts about what the basic framework of Kafka is. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what is the basic framework of Kafka?" Next, please follow the editor to study!

1. Kafka birth background

Kafka, originally developed by Linkedin, is a distributed, partitioned (partition), multi-replica (replica), distributed messaging system based on zookeeper coordination. Its biggest feature is that it can process large amounts of data in real time to meet a variety of demand scenarios: such as hadoop-based batch processing system, low-latency real-time system, storm/Spark streaming engine, web/nginx log, access log, message service and so on. Written in the scala language, Linkedin contributed to the Apache Foundation in 2010 and became a top open source project.

In today's society, various applications such as business, social networking, search, browsing and so on continue to produce all kinds of information like an information factory. In the era of big data, we are faced with the following challenges:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

How to collect these huge information

How to analyze it

How to achieve the above two points in time

The above challenges form a business demand model, that is, producer production (produce) all kinds of information, consumer consumption (consume) (processing and analysis) this information, and between producers and consumers, we need a bridge between the two-message system. From a micro level, this requirement can also be understood as how messages are transmitted between different systems.

Kafka a distributed messaging system emerged as the times require:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Kafka- is open source by linked-in

Kafka- is a framework to solve these problems, which realizes the seamless connection between producers and consumers.

Kafka- 's highly productive distributed messaging system (A high-throughput distributed messaging system)

two。 Why use message system

Decoupling:

Allows you to extend or modify processes on both sides independently, as long as you make sure they follow the same interface constraints.

Redundancy:

Message queues persist data until they have been fully processed, avoiding the risk of data loss. In the insert-get-delete paradigm used by many message queues, before deleting a message from the queue, your processing system needs to clearly indicate that the message has been processed to ensure that your data is safely saved until you have finished using it.

Scalability:

Because message queues decouple your processing, it is easy to increase the frequency of queuing and processing of messages, as long as you add additional processing.

Flexibility & peak processing power:

Applications still need to continue to play a role in the face of a surge in traffic, but such bursts of traffic are not common. It is undoubtedly a huge waste to invest resources at any time in order to be able to handle this kind of peak access. The use of message queuing enables key components to withstand sudden access pressure without completely collapsing due to sudden overloaded requests.

Recoverability:

When some components of the system fail, the whole system will not be affected. Message queuing reduces the coupling between processes, so even if a process that processes messages dies, messages added to the queue can still be processed after the system is restored.

Sequence guarantee:

In most usage scenarios, the order of data processing is important. Most message queues are sorted by nature, and it is guaranteed that the data will be processed in a specific order. (Kafka guarantees the ordering of messages in a Partition)

Buffer:

It is helpful to control and optimize the speed of data flow through the system and solve the inconsistency between the processing speed of production messages and consumer messages.

Asynchronous communication:

In many cases, users do not want or need to process messages immediately. Message queuing provides an asynchronous processing mechanism that allows users to put a message on the queue, but does not process it immediately. Put as many messages as you want in the queue, and then process them when needed.

3. Kafka basic architecture 3.1. Topological structure

3.2. Noun concept

Producer: a message producer, a terminal or service that publishes messages to a kafka cluster.

Broker: the server contained in the kafka cluster.

Topic: the category to which each message published to the kafka cluster belongs, that is, kafka is topic-oriented.

Partition: partition is a physical concept, and each topic contains one or more partition. The unit assigned by kafka is partition.

Consumer: a terminal or service that consumes messages from a kafka cluster.

Consumer group: in high-level consumer API, each consumer belongs to a consumer group, and each message can only be consumed by one Consumer in the consumer group, but can be consumed by multiple consumer group.

Replica: a copy of partition to ensure the high availability of partition.

Leader: a character in replica, producer and consumer only interact with leader.

Follower: a role in replica that copies data from leader.

Controller: one of the servers in the kafka cluster, used for leader election and various failover.

Zookeeper: kafka stores the meta information of the cluster through zookeeper.

4. Basic characteristics of Kafka

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

High throughput and low latency: kafka can process hundreds of thousands of messages per second, with a delay of at least a few milliseconds

Scalability: kafka clusters support hot expansion

Persistence, reliability: messages are persisted to the local disk, and data backup is supported to prevent data loss

Fault tolerance: nodes in the cluster are allowed to fail (if the number of replicas is n, 1 node is allowed to fail)

High concurrency: supports thousands of clients reading and writing at the same time

4.1. Design thought

Consumergroup: each consumer can form a group, and each message can only be consumed by one consumer in the group. If a message can be consumed by multiple consumer, then these consumer must be in different groups.

Message status: in Kafka, the state of the message is saved in consumer. Broker does not care which message is consumed and who consumes it. It only records an offset value (pointing to the location of the next message to be consumed in the partition), which means that if the consumer does not handle it well, a message on broker may be consumed multiple times.

Message persistence: messages are persisted to the local file system in Kafka and are highly efficient.

Message validity: Kafka retains the messages for a long time, so that consumer can consume multiple times, of course, many of the details are configurable.

Batch sending: Kafka supports batch sending in terms of message sets to improve push efficiency.

Push-and-pull: Producer and consumer in Kafka follow the push-and-pull pattern, that is, Producer only broker push messages and consumer only broker pull messages, both of which are asynchronous to the production and consumption of messages. The relationship between broker in Kafka cluster: it is not a master-slave relationship. Each broker has the same status in the cluster. We can add or delete any broker node at will.

In terms of load balancing: Kafka provides a metadata API to manage the load between broker (for Kafka 0.8.x, for 0.7.x, load balancing is mainly achieved by zookeeper).

Synchronous and asynchronous: Producer adopts asynchronous push mode, which greatly improves the throughput of Kafka system (whether synchronous or asynchronous mode can be controlled by parameters).

Partition mechanism partition: the broker side of Kafka supports message partitioning. Producer can decide which partition to send messages to. The order of messages in a partition is the order in which Producer sends messages. There can be multiple partitions in a topic, and the number of specific partitions is configurable. The significance of zoning is very important, and the following content will be reflected gradually.

Offline data loading: Kafka is also ideal for loading data into Hadoop or data warehouses because of its support for extensible data persistence.

Plug-in support: many active communities have developed a number of plug-ins to expand the functionality of Kafka, such as to cooperate with Storm, Hadoop, flume-related plug-ins.

4.2. Application scenario

Log collection: a company can use Kafka to collect the log of various services, and open it to various consumer through kafka as a unified interface service, such as hadoop, Hbase, Solr, etc.

Message system: decoupling and producers and consumers, caching messages, etc.

User activity tracking: Kafka is often used to record various activities of web users or app users, such as browsing, searching, clicking and other activities. These activity information is published to the topic of kafka by each server, and then subscribers subscribe to these topic to do real-time monitoring and analysis, or load to hadoop or data warehouse for offline analysis and mining.

Operational indicators: Kafka is also often used to record operational monitoring data. This includes collecting data from various distributed applications and producing centralized feedback on various operations, such as alarms and reports.

Streaming: such as spark streaming and storm

5. Push mode vs Pull mode 5.1. Point-to-point mode

As shown in the figure above, the point-to-point pattern is usually based on a pull or polling messaging model, which is characterized by one and only one consumer processing messages sent to the queue. After the producer puts the message into the message queue, the consumer takes the initiative to pull the message for consumption. The advantage of the peer-to-peer model is that the frequency of consumers pulling messages can be controlled by themselves. However, whether there are messages to be consumed in the message queue is not perceived on the consumer side, so additional threads are needed to monitor it on the consumer side.

5.2. Publish and subscribe model

As shown in the figure above, the publish-subscribe model is a messaging model based on messaging, and the modified model can have many different subscribers. After the producer puts the message into the message queue, the queue pushes the message to consumers who have subscribed to such messages (similar to the official account of Wechat). Because the consumer passively receives the push, there is no need to perceive whether the message queue is a message to be consumed! However, due to the different machine performance of consumer1, consumer2, and consumer3, the ability to process messages will also be different, but message queues cannot perceive the speed of consumer consumption! So the speed of push has become a problem in the publish and subscribe model! Suppose the processing speed of the three consumers is 8M/s, 5M/s, and 2M/s, respectively. If the queue push speed is 5M/s, then consumer3 cannot bear it! If the queue push speed is 2M/s, then consumer1, consumer2 will be a great waste of resources!

5.3. The choice of Kafka

As a messaging system, Kafka follows the traditional approach of choosing messages from Producer to broker push and from Consumer to broker pull. Some log collection systems (logging-centric system), such as Facebook's Scribe and Cloudera's Flume, use the push mode. In fact, push mode and pull mode have their own advantages and disadvantages.

The push model is difficult to adapt to consumers with different consumption rates, because the message delivery rate is determined by broker. The goal of the push pattern is to deliver messages as quickly as possible, but it is easy to cause Consumer to be too late to process messages, typically characterized by denial of service and network congestion. On the other hand, pull mode can consume messages at an appropriate rate according to the consumption power of Consumer.

For Kafka, the pull mode is more appropriate. Pull mode can simplify the design of broker, and Consumer can independently control the rate of consuming messages. At the same time, Consumer can control its own consumption mode, that is, it can consume in batches or one by one, and it can choose different submission methods to achieve different transmission semantics.

6. Kafka workflow 6.1. Send data

We see in the architecture diagram above that producer is the producer and the entry point to the data. Notice the red arrow in the picture. Producer always looks for leader when writing data, and will not write data directly to follower! So how does leader find it? What is the process of writing? Let's take a look at the following picture:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

First get the partition leader from the cluster

Producer sends messages to leader

Leader writes messages to the local file

Followers pulls messages from l eader

Followers writes the message locally and sends an ACK confirmation to leader

Leader sends ACK confirmation to producer after receiving all copies of ACK

6.1.1. Make sure the messages are orderly

One thing to note is that after the message is written to leader, follower actively goes to leader for synchronization! Producer uses push mode to publish data to broker, and each message is appended to the partition and sequentially written to disk, so make sure that the data in the same partition is orderly! Write the schematic diagram as follows:

6.1.2. Message load partition

It is mentioned above that data is written to different partitions, so why does kafka do partitions? I believe you can guess that the main purpose of zoning is:

Easy to scale: because a topic can have multiple partition, we can easily cope with the growing amount of data by expanding the machine.

Improve concurrency: with partition as read and write unit, multiple consumers can consume data at the same time, which improves the efficiency of message processing.

Friends who are familiar with load balancer should know that when we send a request to a server, the server may load the request and distribute the traffic to different servers. In kafka, if a topic has multiple partition,producer, how do you know which partition to send the data to? There are several principles in kafka:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Partition can specify the partition to be written when writing, and if so, write the corresponding partition

If no partition is specified, but the key of the data is set, a partition will be hash based on the value of key

If neither partition is specified nor key is set, a partition is polled for selection

6.1.3. Make sure you don't lose the message.

Ensuring that messages are not lost is a basic guarantee of message queuing middleware, so how can producer ensure that messages are not lost when writing messages to kafka? In fact, the above written flow chart is described through the ACK response mechanism! When the producer writes data to the queue, you can set a parameter to determine whether or not the kafka receives the data. This parameter can be set to values of 0, 1, and all.

0 means that producer does not need to wait for the return of the cluster to send data to the cluster, which does not ensure that the message is sent successfully. It has the lowest security but the highest efficiency.

1 means that producer sends data to the cluster as soon as the leader reply is answered, and the next message can be sent only to ensure that the leader is sent successfully.

All sends data to the cluster on behalf of producer. All follower needs to be synchronized from leader before the next message is sent, ensuring that leader is sent successfully and all copies are backed up. It is the most secure, but the least efficient.

Finally, it is important to note that if you write data to a topic that does not exist, can it be written successfully? Kafka automatically creates a topic, and the number of partitions and replicas is 1 according to the default configuration.

6.2. Save data

After Producer writes the data to kafka, the cluster needs to save the data! Kafka saves data to disk, and perhaps in our general perception, writing to disk is a time-consuming operation and is not suitable for such highly concurrent components. Kafka initially opens up a separate piece of disk space to write data sequentially (more efficient than random writes).

6.2.1. Partition structure

As mentioned earlier, each topic can be divided into one or more partition, if you think topic is more abstract, then partition is a more specific thing! Partition on the server in the form of a folder, each partition folder there will be multiple groups of segment files, each group of segment files also contains .index file, .log file, .timeindex file (not in previous versions) three files, log file is actually the place to store message, while index and timeindex files are index files used to retrieve messages.

As shown in the figure above, this partition has three sets of segment files, each of which is the same size, but the number of message stored is not necessarily the same (the message size of each item is not the same). The name of the file is named after the minimum offset of the segment. For example, 000.index stores messages with an offset of 0,368795. Kafka uses the way of segmentation and index to solve the problem of search efficiency.

6.2.2. Message structure

The log file mentioned above is actually the place where the message is stored, and what we write to the kafka in producer is also a message, so what does the message stored in log look like? Messages mainly include message body, message size, offset, compression type. The key things we need to know are the following three:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Offset: offset is an ordered id number that occupies 8byte, which uniquely determines the location of each message within the parition.

Message size: message size occupies 4byte, which is used to describe the size of the message

Message body: the message body stores the actual message data (compressed), and the space occupied varies according to the specific message.

6.2.3. Storage strategy

Kafka saves all messages whether or not they are consumed. So what are the deletion strategies for old data?

Based on time, the default configuration is 168 hours (7 days)

Based on size, the default configuration is 1073741824.

It is important to note that the time complexity for kafka to read specific messages is O (1) O (1), so deleting expired files here does not improve the performance of kafka!

6.3. Consumption data

Once the message is stored in the log file, consumers can consume it. When talking about the two modes of message queuing communication, we talk about the point-to-point mode and the publish-subscribe mode. Kafka adopts the publish and subscribe model, and consumers actively go to the kafka cluster to pull messages. Like producer, consumers also go to leader to pull messages when pulling messages.

Multiple consumers can form a consumer group (consumer group), and each consumer group has an id! Consumers of the same consumer group can consume data from different partitions under the same topic, but not multiple consumers in the group consume data from the same partition! Let's take a look at the following picture:

The figure shows that the number of consumers in the consumer group is less than the number of partition, so there will be a situation in which a consumer consumes multiple partition data, and the consumption speed is not as fast as that of consumers who only deal with one partition! If the number of consumers in the consumer group is more than the number of partition, will there be data on multiple consumers spending the same partition? It has been mentioned above that this will not happen! The extra consumers do not consume any partition data. Therefore, in the practical application, it is recommended that the number of consumer of the consumer group is the same as the number of partition!

In the section on saving data, we talked about how partition is divided into groups of segment, each segment contains .log, .index, .timeindex files, and each message contains offset, message size, message body. We have mentioned segment and offset many times. How do we use segment+offset to find messages when looking for messages? What if you need to find a message with an offset of 368801 now? Let's take a look at the following picture:

1. First find the segment file where offset's 368801 message is located (using dichotomy), and what you find here is in the second segment file.

two。 Open the .index file in the found segment (that is, the 368796.index file, which starts with an offset of 368796Secret1, and the message we are looking for with an offset of 368801 has an offset within this index of 368796offset 5index 368801, so the relative offset we are looking for here is 5). As the file uses a sparse index to store the relationship between the relative offset and the corresponding message physical offset, it is impossible to find the index with a relative offset of 5 directly. Here, we also use dichotomy to find the largest relative offset in the index entry in which the relative offset is less than or equal to the specified relative offset, so we find the index with a relative offset of 4.

3. Based on the index found with a relative offset of 4, the physical offset location of the message storage is determined to be 256. Open the data file and scan sequentially from location 256 until you find the Message with offset 368801.

This mechanism is based on the ordered offset, using segment+ ordered offset+ sparse index + binary search + sequential search and other means to find data efficiently. At this point, consumers can get the data that needs to be processed for processing. So how does each consumer record the location of his or her consumption? In earlier versions, consumers consumed offset maintenance zookeeper, and the consumer was reported at regular intervals, which easily led to repeated consumption and poor performance! In the new version, the offset consumed by consumers has been directly maintained in the consumer_offsets topic of the kafka cluster.

At this point, the study on "what is the basic framework of Kafka" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.