Kafka of message queuing (Core Architecture) 04/18 Update SLTechnology News&Howtos

Kafka of message queuing (Core Architecture)

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. The classic architecture of Kafka

Kafka is a distributed message queue used by LinkedIn for log processing, supporting both offline and online log processing.

Kafka classifies messages according to Topic when they are saved.

the sender is Producer, and the release description of the message is Producer.

The recipient of the message is Consumer, and the subscription description of the message is Consumer

each Kafka instance is called Broker, the intermediate storage array is called Broker, and Broker is also a node of the kafka cluster.

two。 Introduction to the role of the architecture (1) broker

A kafka cluster consists of one or more servers, which are called brker.

The broker is the node instance of the intermediate storage queue. We call the message publisher: Produce, the subscriber of the message: Consumer, and the intermediate storage array broker.

(2) topic

every message published to the kafka cluster has a category, which is called Tpoic. Physically different topic messages are stored separately, while logically a topic message is stored in one or more broker. However, users only need to specify the topic of consumption, that is, the client that produces or consumes data does not need to care about where the data is stored.

The object that publishes subscriptions in kafka is topic. Create a topic for each data type, call the client that publishes the message to topic producer, and the client that subscribes to the message from topic consumer,producer and consumer can read and write data from multiple topic at the same time. A kafka cluster consists of one or more broker servers. He is responsible for persisting and backing up specific kafka messages.

topic is the theme of data, the place where data records are published, and can be used to distinguish business systems. Topics in kafka is always in multi-subscriber mode, and a topic can have one or more consumers to subscribe to its data.

(3) partition

partition is a physical concept, and each topic contains one or more partition.

topic's partitioning strategy (partitioning for writing data):

-polling: sequential distribution, only for message when there is no key.

-Hash partition: if message has key, (number of key.hash% partitions). If the message in the partition is not reallocated when the partition is added, the new partition will participate in the load balance only as the data continues to be written.

The partition logical storage method of topic:

The topic is divided into one or more partition, and each partiton is equivalent to a child queue. Physically, each partition corresponds to a physical directory (folder), and the folder is named [topicname] [partition] [serial number]. A topic can have countless partition, which can be set according to business needs and data volume. The number of partition that changes the topic can be configured at any time by raising the num.partitions parameter in the kafka configuration file, and the number of parittion is specified by the parameter when the Topic is created. The number of Topic can also be modified through the tools provided by Kafka after the partiton is created. The partition stores the data itself and the index subscript of the data. When writing data to partition, it is written sequentially, and each time the data is written, there will be something similar to a subscript (index), which grows as the data is written. Partition is also the basic unit of cluster load balancing.

Summary:

-the number of partition in a topic is greater than or equal to the number of broker, which can increase the throughput.

-the Replica of the same partition is distributed to different machines as far as possible and is highly available.

Number of partitions of -kafka: (1 | 2 | 3 + 0.95) * number of broker

(4) Producer

is responsible for actively publishing messages to kakfa broker (push)

Save strategy for kafka messages: each Topic is divided into multiple partition (zones). The position of each message in partition is called offset (offset), and the type is a long number. Even if the message is consumed, it will not be deleted immediately. Instead, it will be saved for a certain period of time and then cleared according to the settings in broker (based on time storage or size). For example, if the log file settings are stored for two days, it will be cleared after two days, regardless of whether the message is consumed or not.

(5) Consumer

message consumer, the client that reads the message to kafkabroker. (pull)

message consumption strategy: (using roundrabin algorithm): if there are four partitions and now there are three consumer threads, then each of the three threads will consume one partition, and the last partition will be sent to the first thread for consumption by polling. If one more thread is added at this time, the fourth partition will be allocated to the new thread consumption. If one thread exits, the fourth partition will be allocated to the new thread consumption. Then the third and fourth partitions will also be polled and sent to the first thread and the second thread for consumption. (this load balancer is automatically maintained within kafka).

Principle of consumption: a consumer only needs to consume a piece of data in a partition once. Each consumer group maintains a subscript file called offset, which is used to record the subscript of the current consumer group consumption data. Each time a piece of data is consumed, the current offset will be incremented by 1 (the data before offset indicates the data that has been consumed).

(6) Consumer group

A consumer group contains multiple consumer, which is pre-configured in the configuration file. Each consumer can form a rent. Each message in a partition can only be consumed by one consumer in a group. Other consumer cannot consume data of the same partition in the same topic. Consumer of different groups can consume data of the same partition of the same topic.

broadcast and unicast:

broadcast: all consumer, each consumer is divided into a group.

unicast: all consumer are grouped (only one consumption is allowed in a group)

's summary of kafka consumption:

-A partition can only be consumed by one member of a consumer group

-A member can consume multiple partitions of a topic

-each Partition in a Topic will only be consumed by one Consumer in a "Consumer group"

-A member can also consume another topic partition

(7) segment

finds multiple partition under the same topic, each partition is a directory. The partition naming rule is: topic name + sequence number, the first partition sequence number starts from 0, the maximum order number is partitions number-1 segment physically consists of multiple partitions, each segment stores multiple message information (default: 1G), and each message is composed of a key-value and a timestamp.

The life cycle of the segment file is determined by the server configuration parameters: by default, it is deleted after 168hours.

segment consists of two parts: index file and data file. These two files correspond to each other and appear in pairs. The suffixes ".index" and ".log" are represented as segment index files and data files respectively.

Naming rules for segment: the first segment of the partion global starts at 0, and each subsequent segment file is named the offset value of the last message in the previous segment file. Numeric values up to 64-bit long size, 19-digit character length, no numbers filled with 0. (this is the case with every partition)

Index file of segment: the index file stores a large amount of metadata, the data file stores a large number of messages, and the metadata in the index file points to the physical offset address of the message in the corresponding data file.

Data file of segment:

To find message for kafka to read data:

To read the message of offset=368776, you need to find it through the following two steps.

The first step: 00000000000000000000.index, indicating the initial file, the starting offset (offset) is 00.0000000000368769.Index the initial offset of the message is 368770 = 368769 + 1 000000000000737337.Index the initial offset is 737338737337 + 1, and so on. These files are named and sorted by the starting offset, and you can quickly locate to specific files as long as you look up the list of files according to the offset binary. Navigate to 00000000000000368769.index and the corresponding log file when offset=368776.

Step 2: when offset=368776, locate the metadata physical location of 00000000000000368769.index and the physical offset address of 00000000000000368769.log in turn, and then search it in 00000000000000368769.log order until offset=368776. Search is through the relative offset, in the .index file there are two columns (sequence, address), where the sequence is the relative offset: sequence = looking for the offset of the message-the start offset of the current file, and then according to the corresponding address of the sequence, find the corresponding location of the data message.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.