Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the Kafka component architecture

2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article focuses on "what is the Kafka component architecture", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what the Kafka component architecture is.

Event sources, ultimate consistency, microservices, CQRS, and so on, the more and more concepts are familiar to modern developers. From fine-grained service assembly to complex business-centric application architecture, the most important piece is business decoupling based on middleware. In this article, we introduce the transaction flow, the basic building block of middleware. It is dominated by Apache Kafka, the de facto transaction flow platform standard, and introduces Kafdrop, a Web interface tool of Kafka.

Overview

Transaction flow platform belongs to a wider class of message-oriented middleware (MoM), similar to traditional message queues and topics, but because of the invariance of log structure, it provides more powerful time guarantee and significant performance improvement. In short, because the write operation of the transaction flow is limited to sequential appends, it is more efficient.

Messages in traditional message queues (MQ) tend to be arbitrarily sorted and usually independent of each other, while transactions (or records) in flows are often sorted in chronological order or causality. Furthermore, the transaction flow keeps its record, and once MQ reads a message, it discards it. As a result, transaction flows tend to be better suited to event-driven architectures, including event sources, final consistency, CQRS, etc. (of course, FIFO message queues are also included, but the differences between FIFO queues and mature transaction flow platforms are very large and are not limited to ordering).

Transaction flow platform is a relatively new paradigm in the field of MoM. Compared with hundreds of MQ-style message agents, only a few mainstream ones are available. Compared with established standards, such as AMQP,MQTT,XMPP and JMS, there is no equivalent standard in the transaction flow space.

Transaction flow platform is an active field of continuous research and experiment. However, the transaction flow platform is not just a commercial product, or a complex academic issue. It can be widely used in messaging and transaction scenarios, and can be used to routinely replace the traditional use scenarios of message queues.

Architecture Overview

The following figure provides a brief overview of the Kafka component architecture. Limited to space here, we will not go into detail how Kafka works inside.

Kafka composition

Kafka is a distributed system that includes the following key components:

Broker (proxy) node: responsible for batch Icano operations and persistence within the cluster. The agent appends log files that contain topic partitions hosted by the cluster. Partitions can be replicated across multiple agents for horizontal scalability and increased persistence, and these replicated partitions are called replicas. One proxy node is the control node (controller), and the other copies are managed by it (followers). A proxy node is elected as the cluster controller, responsible for the internal management of the partition state, and also responsible for mediating the leader-follower role for a given partition.

ZooKeeper node: Kafka in the background requires a way to manage the overall controller state in the cluster. If the controller exits for some reason, there is a protocol that can select another controller from the remaining set of agents. To a large extent, ZooKeeper implements the actual mechanism of controller election, heartbeat and so on. ZooKeeper also acts as a repository for various configurations, maintaining cluster metadata, leader and follower status, quotas, user information, ACL, and other internal management projects. Due to the underlying elections and consensus protocols, the number of ZooKeeper nodes must be odd.

Producer: the client application responsible for publishing messages to Kafka topics. Because Kafka is log structured and can share topics among multi-consumer ecosystems, only producers can modify the data in the underlying log files. The actual Icano is executed by the proxy node on behalf of the producer client. You can publish any number of producer messages to the same Kafka topic and select the partition in which to save the record.

Consumer: a client application that reads messages from the subject. Any number of consumers can read from the same topic; however, depending on the configuration and grouping of consumers, there are rules governing the distribution of records among consumers.

Partitions, records, offsets, and topics

A partition is a completely ordered sequence of records, with each partition corresponding to an append log, which is the basis of Kafka. Each record has an integer offset of ID:64 bits and a millisecond timestamp. It may have a key and a value. Both are byte arrays and both are optional. The term "full sort" only means that for any given producer, records will be written in the order issued by the application. If record P is published before Q, P will precede Q in the partition. (suppose P and Q share a partition.) In addition, all consumers will read them in the same order. For every possible consumer, the P will always be read before Q. In most use cases, this order assurance is critical. Typically, published records will correspond to some real-world transactions, and it is often necessary to keep a schedule for those transactions.

The offset of a record is the unique identification score of a record in the partition. The offset is a strictly monotone increasing integer in the sparse address space, each record offset is always higher than the previous record offset, and there may be a variable gap between the adjacent offsets. If compression is enabled or as a result of a transaction, there must be gaps, so the offset may not be contiguous.

The application should not attempt to literally interpret the offset, nor should it guess what the next offset will be. However, you can infer the relative order of any record based on the offset and sort the records by the offset of the record.

The following figure shows the structure of the internal partition:

The first offset (also known as low-water mark, the low water mark) is the first message to be displayed to the consumer. Due to the retention period of Kafka, it is not necessarily the first message to be released. Records can be trimmed based on time and / or partition size. When this happens, the low watermark seems to move backward and records earlier than the low watermark will be truncated.

The theme is the logical composition of partitions. A topic can have one or more partitions, while a partition can have only one topic or parts of a topic. Themes are the foundation of Kafka, allowing parallelism and load balancing. As we said earlier, the partition shows the total order. Because the partitions within a topic are independent of each other, the topic is said to have a partial order. In short, this means that some records can be sorted against each other, but not relative to some other records. While the concepts of total order and partial order sound academic, they are important in building a performance transaction flow pipeline. It enables us to process records in parallel where possible while maintaining order where necessary. Later, we will explore the concepts of record order, consumer parallelism, and topic size.

Example: message publishing

Practice is the only criterion for testing truth. I put the theory into practice and illustrate the concept with examples. We will launch a pair of Docker containers, one using the Kafka container and the other a Kafdrop container. We use Docker Compose to enable containers.

Create a docker-compose.yaml file in the selected directory as follows:

For convenience, we use obsidiandynamics/kafka mirroring, which cleverly packages Kafka and ZooKeeper in a single image. Then start the container through docker-compose up. After the startup is successful, you can visit localhost:9000 through the browser to see the Kafdrop login interface.

There is a single agent cluster in the instance and there is no theme yet. We can use Kafka's command-line tool to create a topic and post some messages. We can use the docker exec tool to manipulate the kafka container to easily invoke the built-in CLI tool:

Docker exec-it kafka-kafdrop_kafka_1 bash

The above command will give us access to the container's shell command line interface. The tool is located in the / opt/kafka/bin directory, where cd enters:

Create a theme called streams-intro that contains 3 partitions:

Switch back to the Kafdrop interface, and we can now see the theme created by the new master in the list.

Next, we can use the kafka-console-producer tool to publish the message:

Note: kafka-topics uses the-- bootstrap-server parameter to configure the Kafka proxy list, while kafka-console-producer uses-- broker-list.

Records are separated by newline characters. The key and value parts are separated by colons, as indicated by the key.separator attribute. In this case, we can enter the following:

When you are finished, press CTRL + D to complete the message release. Then switch back to Kafdrop, and then click the streams-intro topic. You will see an overview of this topic and a detailed classification of the underlying partitions:

We created a theme with three partitions. Then we released five records using two unique keys, foo and bar. Kafka uses keys to map records to partitions so that all records with the same key always appear on the same partition. It is convenient and important to enable publishers to specify the exact order of records. Later, we will discuss key hashing and partition allocation in more detail.

Looking at the partition table, the first and last offsets of partition # 0 are 0 and 2, respectively. Partition # 2 has zero and 3 values, while partition # 1 is blank. Clicking # 0 in the Kafdrop web user interface will take you to the topic viewer:

You can see the two records posted under the bar key. Note that they have nothing to do with foo records.

Consumers and consumer groups

As an example above, we have heard the producer publish a message and send the record to the stream. These records are organized into organized partitions. Kafka's publish-subscribe topology follows a flexible many-to-many model, so any number of producers and consumers can interact with the stream at the same time. Depending on the actual solution, the flow topology can also be one-to-many, many-to-one. Let's talk about how to consume these records.

A consumer is a process or thread that connects to a Kafka cluster through a client library. Consumers are usually (but not necessarily) members of an overall consumer group. This group is specified by the group.id property. A consumer group is actually a load balancing mechanism in Kafka, which is responsible for roughly evenly dividing the consumer instances within the group. When the first consumer in the group subscribes to the topic, it receives all partitions in the topic. When the second consumer then joins in, it will get about half of the partition, thus reducing the burden on the first user. When the consumer leaves (by disconnecting or timeout), the process will be reversed and more partitions will be available to the remaining users.

As a result, records in a topic of consumer consumption extract a share from the partition assigned by Kafka and other consumers to which it belongs. In terms of load balancing, this should be very simple. However, there is a key point here that the behavior of using records cannot be deleted. This seems contradictory at first, especially if consumption behavior is associated with consumption. (if any, consumers should be called "readers".) The simple fact is that consumers have absolutely no influence on the theme and its partition. The theme is appended only and can only be appended by the producer or by the Kafka itself (as part of compression or cleanup). Consumers' read-only operations are "cheap", so it allows many people to tail logs without increasing the burden on the cluster. This is another difference between transaction flows and traditional message queues, which is crucial.

The consumer maintains an offset internally that points to the next record in the partition, increasing the offset each time it is read continuously. When consumers subscribe to a topic for the first time, they can choose to start at the head or tail of the topic. You can control this behavior by setting the auto.offset.reset property to latest, earliest, or none. In the latter case, if the consumer group does not have a previous offset, an exception is triggered.

The consumer retains its offset state vector locally. Because consumers in different consumer groups do not interfere with each other, many people may read the same topic at the same time. Consumers read messages at their own offset; slow or overstocked consumers have no effect on the rest of their group.

To illustrate this concept, let's consider a theme with two partitions as a scenario. Two consumer groups-An and B-subscribe to this topic. Each group has three instances, and the user is named A1 Magi A2 Magi A3 Magi B1 Magi B2 and B3. The following figure shows how the two groups share themes and how consumers browse records independently of each other.

If you look closely at the picture above, you will find that something is missing. Consumer A3 and B1 are not shown in the above figure. This is because Kafka guarantees that partitions can only be assigned to one consumer in its consumer group. Because there are three consumers in each group, but only two partitions, one consumer will remain idle, waiting for another consumer in his group to leave. In this way, the consumer group is not only a load balancing mechanism, but also a fence-like exclusive control used to establish a high-performance pipeline without sacrificing security. especially if only one thread is required to process the record or at any given time.

Consumer groups are also used to ensure availability. By periodically extracting records from the topic, consumers can implicitly feedback to the cluster that the cluster is "healthy", thus extending the lease to their partition allocation. However, if the consumer fails to read it again within the allowed period, it is considered defective and its partition will be reassigned to the rest of the "healthy" consumers in the group. The due date is controlled by max.poll.interval.ms on the consumer client property and is set to five minutes by default.

Using the transportation system as an analogy, the theme is like the highway, and the zoning is the driveway. A record is equivalent to a car, and its passengers correspond to the record value. As long as you keep the route, several cars can drive safely on the same highway. Cars that share the same route run sequentially to form a queue. Now, suppose each lane leads to a ramp, shifting its traffic to a location. If one ramp accumulates, the other ramps may still flow smoothly.

Kafka uses this mechanism to ensure end-to-end throughput and easily achieve QPS of millions of records per second. When creating a theme, you can select the partition count and the number of channels. Divisions are divided roughly evenly among consumers in a consumer group and ensure that partitions are not assigned to two (or more) consumers at the same time.

Note: after creation, you can resize the theme by increasing the number of partitions. However, you cannot reduce the number of partitions without recreating the theme.

Record content that corresponds to an event, message, command, or any other streaming. The precise division of records is determined by the producer. Producers can explicitly assign partitioned indexes when publishing records, although this method is rarely used. As we did in the previous example, a more common approach is to assign keys to records. Keys are completely opaque to Kafka; in other words, Kafka does not interpret the contents of key, but treats it as an array of bytes. Use a consistent hashing technique to hash these bytes to get a partitioned index.

Records that share the same hash are guaranteed to occupy the same partition. Assuming that a topic has multiple partitions, records with different keys may end up in different partitions. However, records with different hash values may end up in the same partition because of hash key conflicts.

The producer does not need to care about which particular partition the records will be mapped to, as long as the relevant records end up in the same partition and retain their order. Similarly, consumer pairs do not care about assigning to that partition, as long as they receive records in the same order as the publication, and their partition assignments are not duplicated with other consumers in the group.

Case: trading platform

Suppose we are looking for a specific price model for listed stocks and send a trading signal after determining the specific pattern. There is a large amount of inventory, and it is understandable that you want to process them in parallel. However, the time series of any given stock symbol must be processed sequentially on a single user.

Kafka makes this use case and other similar use cases almost impossible to implement. We will create two themes: price, which is used to store the original price data. Order theme, which is used to save any orders generated by. We can divide more partitions, so that we can fully operate in parallel.

We can publish a record of each price on the price topic and use the stock symbol as the key. Automatic partitioning of Kafka ensures that each stock symbol is processed by a consumer in its group. Consumer instances are free to expand and expand to match the processing load. Consumer groups should be named meaningfully, ideally reflecting the purpose of the consumer application. Trading-strategy.abc, for example, is a virtual trading strategy called "ABC".

Once the consumer has determined the price model, another message, the order request, can be posted on the order subject. We will convene another consumer group, order execution, responsible for reading the order and forwarding it to the broker.

In this simple example, we create an end-to-end trading pipeline that is fully event-driven and highly scalable, assuming there are no other bottlenecks. We can dynamically add more processing nodes at each stage to cope with the need for increased load.

Suppose you need several trading strategies that run at the same time driven by a common data source. In addition, trading strategies will be developed by different teams; the goal is to decouple these implementations as much as possible so that teams can operate on their own, and even develop and deploy at their own pace using different programming languages and tool chains.

Kafka's flexible many-to-many pub-sub architecture combines state consumption with broadcast semantics. By using different consumer groups, Kafka allows different applications to share input topics and process events at their own pace. The second trading strategy will require a dedicated consumer group: trading-strategy.xyz, which applies its specific business logic to the generic pricing stream and publishes the generated order to the same order theme. In this way, Kafka can build a modular event processing pipeline from discrete elements that are easy to reuse and combine.

At this point, I believe you have a deeper understanding of "what the Kafka component architecture is". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report