What is the basic concept of Kafka 03/31 Update SLTechnology News&Howtos

What is the basic concept of Kafka

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you what is the basic concept of Kafka. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

0. Introduction

Kafka is an open source LinkedIn message server, implemented in Scala; the performance of this product is millions of QPS (estimated to mount multiple disks), I casually write a test program is 100, 000.

1. Basic concepts of Kafka

In Kafka, messages are classified by Topic; every message published to the Kafka cluster has a category called Topic. Physically, messages with different Topic are stored separately. Logically, messages from a Topic are stored on one or more broker, but users only need to specify the Topic of the message to produce or consume data, regardless of where the data is stored.

Each Topic contains one or more Parition;Parition is a concept on physical storage, and the number of Parition can be specified when the Topic is created. Each Parition corresponds to a storage folder under which the message data and index files held by the Parition are stored. The main purpose of Topic partition is for performance considerations. Kafka tries to evenly distribute all partitions to all nodes in the cluster instead of concentrating on some nodes. In addition, the master-slave relationship is also balanced as far as possible, so that each node will act as the Leader of a certain proportion of the partition. Each Parition is an ordered queue, and each message has an offset in the Parition.

The publisher of the message can publish the message to the specified Topic, and the Producer can also decide which Parition to send the message to (can also adopt random, hashing, rotation, and so on).

Consumers of messages actively pull messages from Kafka for consumption (pull mode). Messages in a Parition in Kafka can be consumed by an unlimited number of consumers, and each consumer is completely independent. Kafka does not delete messages after each Consumer consumption. Messages in Kafka are deleted on a regular basis. You can specify how long the messages will not be deleted. Messages from any location can be consumed by specifying an offset, provided that the specified offset exists. From this point of view, Kafka is more like a small file management system that can only be appended, cannot be modified, and supports random reading.

As mentioned above, each Consumer is completely independent, and it cannot be done if multiple Consume want to consume the same Parition of the same Topic in turn. Later, Kafka invented a concept of Consumer-group, when each Consumer client is created, it registers its own information with Zookeeper; multiple Consumer in a group can staggered consume all the Paritions; of a Topic. In short, it ensures that all Paritions of this Topic can be consumed by this group, and for the sake of performance, let Parition be relatively evenly distributed to each Consumer, and the Consume-group is completely independent. The opposite of the owner is good, but the tragedy is that the client basically does not support it. It seems that only the client support of Java is better.

2. Message sequencing and reliability design

Messages published to Kafka are stored sequentially in a Parition. Publishers can publish to multiple partitions by random, hashing, rotation training, etc., and consumers consume by specifying offset; so the sequence of messages in Kafka depends more on how the user uses it.

Messages in the Kafka system support disaster recovery backup storage. Each Parition has the concept of primary partition and standby partition. The primary partitions of multiple Parition in a Topic may fall on different physical machines. Kafka also tries to distribute them on different machines to improve system performance. The message is read and written directly through the main partition, and the client should be directly connected to the physical machine where the main partition is located for read and write operations. The standby partition is like a "Consumer" that consumes the message of the primary partition and saves it in the local log for backup; the primary partition is responsible for tracking the status of all standby partitions, and if the standby partition "lags" too much or fails, the primary partition will remove it from the synchronization list; the primary and standby partitions are managed through zookeeper.

The reliability of the release depends on two points: the confirmation mechanism of the sender and the landing strategy of the Kafka system. The sender supports three mechanisms: no acknowledgement, primary partition confirmation (the primary partition sends the confirmation receipt after receiving the message), and primary / standby partition confirmation (the primary partition sends the confirmation receipt only after the backup partition message is synchronized). There are two ways to refresh the landing strategy of the Kafka system: by configuring the number of messages and configuring the time interval between flushing.

The reliability of consumption depends on the consumer's reading logic, and Kafka does not save any state of the message. The three modes of At most once, At least once and Exactly once need to be implemented according to the business, and the easiest to implement is At least once. Neither of them can be achieved completely and absolutely in the distributed system, so they can only be infinitely close to reduce the error rate.

3. Message storage mode

Parition is stored in the file system as a file, such as creating a Topic called tipocTest, which has four Parition and four folders under the Kafka's data directory, named after Topic-partnum.

The contents of each folder

Each Message in Parition is represented by offset as its offset in this Parition. This offset is not the actual storage location of the Message in the Parition data file, but a logical value that uniquely determines a Message in the Parition. Therefore, offset can be thought of as the id of Message in Parition. Each Message in Parition contains three attributes: the data files of Offset, DataSize, and Data;Parition contain several Message of the above format, arranged together according to offset from smallest to largest; Kafka receives a new message and appends it to the end of the file, so the release efficiency of the message is very high.

Let's consider another question: what if an Parition has only one data file? The new message is added to the end of the file, which is always O (1) no matter how large the file data file is. However, when reading, the Message is looked up sequentially according to the offset, so if the data file is very large, the efficiency of the search is low. So how does Kafka solve the problem of search efficiency? 1) Segmentation, 2) Index.

4. Segmentation and indexing of data files

One of the ways for Kafka to solve the query efficiency is to segment the data file. The maximum value of each data file can be configured. Each segment is placed in a separate data file, and the data file is named after the smallest offset in the segment. In this way, when looking for the Message of the specified offset, you can use a binary lookup to locate which segment the Message is in.

Data file segmentation makes it possible to find the Message of the corresponding offset in a smaller data file, but this still requires sequential scans to find the Message of the corresponding offset. In order to further improve the efficiency of lookup, Kafka establishes an index file for each segmented data file, and the file name is the same as the name of the data file, except that the file extension is .index. The index file contains several index entries, each of which represents the index of an Message in the data file-the corresponding relationship between Offset and position (the absolute position of Message in the data file); instead of indexing each Message in the data file, the index file uses sparse storage to build an index every certain byte of data. This prevents the index file from taking up too much space, so that the index file can be kept in memory. But the disadvantage is that the Message that is not indexed can not locate its location in the data file at once, so it needs to do a sequential scan, but the scope of this sequential scan is very small.

Each segment also has a .timeindex index file, which is in the same format as the .index file, recording a sparse index of the message release time and offset for regular message deletion.

The following figure is an example of a segmented index.

This mechanism is based on the fact that offset is ordered; index files are mapped to memory, so lookups are fast. In a word, Kafka's Message storage uses Parition, segment and sparse indexes to achieve efficient publishing and random reading.

5. Consumer side design

For the sake of performance and disaster recovery, the actual demand is that multiple Consumer consumes one Topic; since multiple Consumer are independent of each other, you can use competitive Parition for post consumption. At the same time, only one Consumer is consuming a Parition, and multiple Consumer regularly synchronize the offset status; if you need multi-channel consumption, you can compete for different Parition corresponding resources for post consumption.

Because the Kafka is read according to offset, the general client is encapsulated as follows: given a starting offset, subsequent get can be read sequentially. Without the concept of consumption confirmation, Kafka will not maintain the status of every message and every Consumer. In fact, it is not difficult to implement a set of consumption confirmation mechanism. We need to implement a proxy layer, which retains a circular buffer in the proxy layer, and can be removed from the buffer only after the business end consumption is confirmed. If it is not confirmed for a period of time, it will be sent again next time, similar to the concept of tcp sliding window.

These are the basic concepts of Kafka shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.