Features and components of kafka 07/12 Update SLTechnology News&Howtos

Features and components of kafka

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "the characteristics and components of kafka". In the daily operation, I believe that many people have doubts about the characteristics and components of kafka. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about the characteristics and components of kafka! Next, please follow the editor to study!

I. characteristics

MQ was born to solve the problem of mismatch between producers and consumers, so one of the most basic requirements of the MQ system is that the writing speed must be fast, even if the team speed is slow, because the duration of the business peak is limited, and there is plenty of time for consumers to digest slowly after the peak, not to mention simple and rough to add a few more consumers.

1. Provide high throughput for both publish and subscribe. It is understood that Kafka can produce about 250000 messages per second (50 MB) and process 550000 messages per second (110 MB).

two。 Persistence operations can be carried out. Persist messages to disk, so they can be used for bulk consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to the hard disk and replication.

3. Distributed system, easy to expand outward. All producer, broker, and consumer will have multiple, all distributed. The machine can be expanded without downtime.

4. The state in which the message is processed is maintained on the client side, not on the server side. It can balance automatically when it fails.

5. Scenarios that support online and offline.

II. Components

Broker: the middle between publish and subscribe, stacking messages.

Zookeeper: manage the kafka cluster and monitor the status of each broker so that both publish and subscribe to a valid broker.

Producer: the publisher of the message. Producer publishes the message to the specified Topic, and the Producer can also decide which partition to send the message to.

Consumer: asynchronous consumers who subscribe to the topic they are interested in and pull messages from broker for consumption. In kafka, messages in a partition are consumed by only one consumer in group (at the same time); consumer message consumption in each group is independent of each other; we can think of a group as a "subscriber", and each partions in a Topic is consumed by only one consumer in a "subscriber", but a consumer can consume messages in multiple partitions at the same time. Kafka can only guarantee that messages in a partition are consumed sequentially by a consumer. In fact, from a Topic point of view, when there are multiple partitions, the message is still not globally ordered.

* * Consumer group: when each consumer client is created, it registers its own information with zookeeper. This function is mainly for "load balancing". * * multiple consumer in a group can staggered consume all the partitions; of a topic. In short, ensure that all partitions of this topic can be consumed by this group, and for the sake of performance, let the partition be relatively evenly distributed to each consumer.

Third, principle

1. Persistence

Kafka uses file storage messages (append only log), which directly determines that the performance of kafka depends heavily on the characteristics of the file system. And no matter under any OS, the optimization of the file system itself is very difficult. File caching / direct memory mapping is a commonly used means. Because kafka performs append operations on log files, the cost of disk retrieval is small; at the same time, in order to reduce the number of disk writes, broker will temporarily buffer messages, and then flush to disk when the number of messages (or size) reaches a certain threshold, thus reducing the number of disk IO calls. For kafka, higher-performance disks will bring more direct performance improvements.

two。 Performance

In addition to disk IO, we also need to consider network IO, which is directly related to the throughput of kafka. Kafka does not provide many superb skills; for the producer side, you can buffer messages. When the number of messages reaches a certain threshold, batch send to broker; is the same for the client side, batch fetch multiple messages. However, the size of the message volume can be specified through the configuration file. For the kafka broker side, there seems to be a sendfile system call that can potentially improve the performance of the network IO: map the data of the file to the system memory, and the socket can read the corresponding memory area directly without the need for the process to copy and exchange again (here involves "disk IO data" / "kernel memory" / "process memory" / "network buffer", data copy among many).

In fact, for the three producer/consumer/broker, the cost of CPU should be small, so enabling the message compression mechanism is a good strategy; compression requires a small amount of CPU resources, but for kafka, network IO should be considered. Any message transmitted on the network can be compressed. Kafka supports gzip/snappy and other compression methods.

3. Load balancing

Any broker in a kafka cluster can provide metadata information to producer, and these metadata contain information such as "servers list living in the cluster" / "partitions leader list" (see node information in zookeeper). When producer acquires metadata information, producer will keep socket connection with all partition leader under Topic; messages will be sent by producer directly through socket to broker without going through any "routing layer".

To send multiple messages asynchronously, buffer multiple messages temporarily on the client, and send them in batches to broker; small data IO too much, which will slow down the overall network delay. Batch delay actually improves network efficiency. However, there are certain hidden dangers. For example, when the producer fails, those messages that have not yet been sent will be lost.

4.Topic model

In other JMS implementations, the location of message consumption is reserved by proper, in order to avoid sending messages repeatedly or resend messages that have not been successfully consumed, and to control the status of messages. This requires too much extra work for JMS broker. In kafka, there is only one consumer consuming messages in partition, and there is no message status control and no complex message confirmation mechanism, so it can be seen that the kafka broker side is quite lightweight. * * when the message is received by consumer, consumer can save the offset of the last message locally and intermittently register offset with zookeeper. Moreover, in the "auto.offset.reset" configuration item in the kafka0.10 version, any option, whether eraliest, lastest, none, or anything else, will not take effect until there is no initial offset or if the offset is missing. * * it can be seen that consumer clients are also very lightweight.

Consumer in kafka is responsible for maintaining the consumption records of messages, but broker does not care about these. This design not only improves the flexibility of the browser side, but also moderately reduces the complexity of the Broker side, which is different from many JMS proper. In addition, the design of message ACK in kafka is also very different from that of JMS. Messages in kafka are sent to consumer in batches (usually in terms of the number of messages or the size of chunk). When the message is consumed successfully, the offset of the message is submitted to zookeeper, but the ACK is not delivered to broker. You may have realized that this "loose" design is in danger of "losing" / "resending" messages.

5.log

Each log entry format is "4-byte number N represents message length" + "N-byte message content"; each log has an offset to uniquely mark a message, and the value of offset is an 8-byte number, indicating the starting position of the message in this partition. At the physical storage level, each partition is composed of multiple log file (called segment). The segment file is named "minimum offset" .kafka. For example, "00000000000.kafka"; where "minimum offset" represents the offset of the starting message in this segment.

When getting a message, you need to specify the offset and the maximum chunk size. Offset is used to represent the starting position of the message, and chunk size is used to represent the maximum total length of the message (indirectly represents the number of messages). According to offset, you can find the segment file where the message is located, then take the difference according to the minimum offset of segment, get its relative position in file, and read the output directly.

6. Distributed system

Kafka uses zookeeper to store some meta information, and uses the zookeeper watch mechanism to discover changes in meta information and take corresponding actions (such as consumer failure, triggering load balancing, etc.)

Broker node registry: when a kafka broker starts, it first registers its own node information (temporary znode) with zookeeper, and when broker and zookeeper are disconnected, the znode is deleted.

Broker Topic Registry: when a broker starts, it registers its topic and partitions information with zookeeper, which is still a temporary znode.

Consumer and Consumer group: when each consumer client is created, it registers its own information with zookeeper; this function is mainly for load balancing. Multiple consumer in a group can staggered consume all the partitions; of a topic. In short, ensure that all the partitions of the topic can be consumed by this group, and for the sake of performance, let the partition be relatively evenly distributed to each consumer.

Consumer id Registry: each consumer has a unique ID (host:uuid, which can be specified through a configuration file or generated by the system), which is used to mark consumer information.

Consumer offset Tracking: used to track the largest partition currently consumed by each consumer. This znode is a persistent node, so we can see that offset is related to group_id to show that when one consumer in the group fails, the other consumer can continue to consume.

Partition Owner registry: used to mark which consumer partition is being consumed. Temporary znode. This node expresses that "a partition" can only be consumed by the next consumer of the group, and when a consumer under the group fails, it will trigger load balancing (that is, let partitions balance consumption among multiple consumer and take over those "free" partitions)

When consumer starts, the action triggered is:

A) carry out "Consumer id Registry" first

B) then register a watch under the "Consumer id Registry" node to listen for "leave" and "join" of other consumer in the current group. Any change in the node list under this znode path will trigger the load balance of the consumer under this group. (for example, if one consumer fails, then other consumer takes over partitions).

C) under the "Broker id registry" node, register a watch to monitor the survival of the broker; if the broker list changes, it will trigger all consumer re-balance under the groups.

Summary:

The Producer side uses zookeeper to "discover" the broker list, establish an socket connection with each partition leader under the Topic and send messages.

The Broker side uses zookeeper to register broker information and has monitored partition leader viability.

The Consumer side uses zookeeper to register consumer information, including the partition list consumed by consumer. It is also used to discover the broker list, establish an socket connection with partition leader, and obtain messages.

7. Copy management

Kafka tries to distribute all partitions evenly to all nodes in the cluster instead of concentrating on some nodes. In addition, the master-slave relationship is also balanced as far as possible so that each point acts as the leader of a certain proportion of the partition.

It is also important to optimize the leader selection process, which determines how long the window period is in the event of a system failure. Kafka selects a node as the "controller". When it finds that a node down is missing, it is responsible for selecting a new leader among all the nodes in the partition, which enables Kafka to manage the master-slave relationship of all partition nodes in batches and efficiently. If controller down is dropped, one of the living nodes will be ready to switch to the new controller.

8.Leader synchronizes with replica

Copies are created in topic partitions, each with one leader and zero or more followers. All read and write operations are handled by leader. Generally, the number of partitions is much more than the number of broker, and the leader of each partition is evenly distributed in the brokers. All followers replicates the leader log, and the messages and order in the log are the same as those in leader. Followers pulls messages from leader like a normal consumer and saves them in its own log file.

Many distributed messaging systems automatically process failed requests, and they have a clear definition of whether a node is alive. There are two conditions for Kafka to determine whether a node is alive or not:

The node must be able to maintain a connection to the ZooKeeper, and Zookeeper checks the connection of each node through the heartbeat mechanism.

If the node is a follower, it must be able to synchronize leader writes in time, with a delay of not too long.

Nodes that meet the above conditions should be "in sync" rather than vaguely "alive" or "failed". Leader tracks all "synchronized" nodes, and once a down is dropped, stuck, or delayed for too long, leader removes it. As for how long the delay is "too long", it is determined by the parameter replica.lag.max.messages, how it is stuck and how it is determined by the parameter replica.lag.time.max.ms.

IV. Version comparison

(I), 0.8.x vs 0.9

1. Safety

Kafka provides three security features.

One is to provide Kerberos and TLS authentication.

Second, it provides a similar Unix-like permission system to control which users can access the data.

The third is to provide data transmission encryption.

Of course, only the new producer,consumer API and the 0. 9 consumer implementation can use these security features. The old API still doesn't have these security controls.

These security features achieve backward compatibility so that users who do not enable security features do not have to worry about performance degradation.

This is only a security feature of the first release, and more security controls will be provided in future releases.

two。 New Consumer

Kafka 0.8.2, Producer was redesigned, and Kafka 0.9 redesigned the Consumer interface. It no longer distinguishes high-level consumer API from low-level consumer API, but provides a unified consumer API.

1) Kafka can maintain Offset and consumer Position on its own. Developers can also maintain the Offset themselves to achieve related business requirements. When consuming, you can consume only the specified Partitions

2)。 You can use external storage to record the Offset, such as a database.

3)。 Self-control the location of Consumer consumption messages.

4)。 You can use multithreading for consumption.

3. Define quotas for users

A large Kafka cluster may have multiple users. Prior to 0. 9, if consumer processed messages very quickly, it might monopolize the network resources of the entire broker, and so did producer. Kafka 0.9 now provides client-based user quota control. For Producer, you can control the number of bytes per second written per client, and for Consumer, you can control the number of bytes per second read per client.

(II), 0.9 vs 0.10

1.Kafka Streams

Kafka Streams includes a set of high-level languages API (such as joining, filtering and aggregating records) that describe common stream operations, which allows developers to quickly develop powerful streaming applications. Kafka Streams provides stateless and stateless processing power and can be deployed on many systems.

2.Connectors connection status / controlled REST API

In Kafka 0.10.0.0, Kafka Connect has been continuously improved.

Previously, users needed to monitor the logs to see the status of each connectors and their task, and now Kafka supports the acquired status API, which makes monitoring easier.

A control-related API has also been added, which allows the user to stop a connector; or manually restart those failed task during maintenance. These can be intuitively displayed and managed in the user interface connector can currently be seen in the control center (Control Center).

3.SASL improvement

New security features, including support for Kerberos through SASL. Apache Kafka 0.10.0.0 now supports more SASL features, including external authorization servers, multiple types of SASL authentication on one server, and other improvements.

4.Rack Awareness

Now Kafka has built-in rack awareness to isolate replicas, which allows Kafka to ensure that replicas can span multiple racks or availability zones, significantly improving the resilience and availability of Kafka. This function is provided by Netflix.

5.Kafka Consumer Max Records

With Kafka 0.9.0.0, developers had little control over the number of messages returned when using the poll () function on the new consumer. Happily, however, this version of Kafka introduces the max.poll.records parameter, which allows developers to control the number of messages returned.

6. Protocol version improvement

Kafka brokers now supports request API that returns all supported protocol versions. The advantage of this feature is that it will allow a client to support multiple broker versions in the future.

At this point, the study on the "characteristics and components of kafka" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.