What are the knowledge points of kafka? 07/04 Update SLTechnology News&Howtos

What are the knowledge points of kafka?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces what are the knowledge points of kafka? the content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

1. Kafka HA1.1 replication

As shown in figure .1, there may be multiple replica in the same partition (corresponding to the default.replication.factor=N in the server.properties configuration). Without replica, once the broker goes down, all the patition data on it cannot be consumed, and producer can no longer store the data in its patition. After the introduction of replication, there may be multiple replica in the same partition, and you need to choose between these replica that one leader,producer and consumer only interact with this leader, and the other replica replicates data from the leader as a follower.

The algorithm for allocating Replica by Kafka is as follows:

1. Sort all the broker (assuming a total of n broker) and the partition to be assigned by 2. Assign the I partition to the (i mod n) broker 3. Assign the j replica of the I partition to the 1.2 leader failover on the ((I + j) mode n) broker

When the leader corresponding to the partition goes down, a new leader needs to be elected from the follower. When electing a new leader, a basic principle is that the new leader must have all the messages from the old leader commit.

Kafka dynamically maintains an ISR (in-sync replicas) in the zookeeper (/ brokers/.../state). From the writing process of Section 3.3, it is known that all the replica in the ISR follow the leader, and only the members of the ISR can be selected as leader. For 1 replica, a partition can guarantee that the message will not be lost while tolerating f replica failures.

When all replica is not working, there are two possible options:

1. Wait for any replica in the ISR to come back to life and choose it as the leader. Data is guaranteed not to be lost, but it may take a relatively long time. two。 Select the first living replica (not necessarily an ISR member) as the leader. There is no guarantee that data will not be lost, but it will be relatively unavailable for a short time.

Kafka 0.8.* use the second method.

Kafka elects leader through Controller.

1.3 broker failover

Kafka broker failover sequence

Process description:

1. Controller registers Watcher on the / brokers/ids/ [brokerId] node of zookeeper. When broker goes down, zookeeper will fire watch 2. Controller reads available broker from / brokers/ids node 3. Controller determines set_p, which contains all partition 4 on the down broker. For each partition 4.1 in the set_p, read ISR 4.2 from the / brokers/topics/ / partitions/ / state node to determine the new leader (as described in Section 4.3) 4.3 write the new leader, ISR, controller_epoch, and leader_epoch information to the state node 5. Send the leaderAndISRRequest command 1.4 controller failover to the relevant broker via RPC

Controller failover is triggered when controller goes down. Each broker registers the watcher with the "/ controller" node of the zookeeper, and when the controller goes down, the temporary node in the zookeeper disappears, all the surviving broker is notified by the fire, each broker attempts to create a new controller path, and only one is elected and elected controller.

When a new controller is elected, the KafkaController.onControllerFailover method is triggered and the following is done in this method:

1. Read and add Controller Epoch. two。 Register watcher on reassignedPartitions Patch (/ admin/reassign_partitions). 3. Register watcher on preferredReplicaElection Path (/ admin/preferred_replica_election). 4. Register watcher on broker Topics Patch (/ brokers/topics) via partitionStateMachine. 5. If delete.topic.enable=true (default is false), partitionStateMachine registers watcher on Delete Topic Patch (/ admin/delete_topics). 6. Register Watch on Broker Ids Patch (/ brokers/ids) via replicaStateMachine. 7. Initialize the ControllerContext object and set all current topic, "live" broker list, leader and ISR of all partition, etc. 8. Start replicaStateMachine and partitionStateMachine. 9. Set the brokerState status to RunningAsController. 10. The Leadership information for each partition is sent to all "living" broker. 11. If auto.leader.rebalance.enable=true (the default is true), the partition-rebalance thread is started. twelve。 If delete.topic.enable=true and there is a value in Delete Topic Patch (/ admin/delete_topics), delete the corresponding Topic. two。 Consumer consumption message 2.1 consumer API

Kafka provides two sets of consumer API:

1. The high-level Consumer API 2. The SimpleConsumer API

High-level consumer API provides a high-level abstraction of consuming data from kafka, while SimpleConsumer API requires developers to pay more attention to detail.

2.1.1 The high-level consumer API

High-level consumer API provides the semantics of consumer group. A message can only be consumed by a consumer in the group. Consumer consumes messages without paying attention to offset, and the last offset is saved by zookeeper.

Using high-level consumer API can be a multithreaded application, you should pay attention to:

1. If the consumption thread is larger than the number of patition, some threads will not receive message 2. 5. If the number of patition is greater than the number of threads, some threads receive more than one patition message 3. 5. If a thread consumes more than one patition, there is no guarantee of the order of the messages you receive, and the messages in a patition are ordered 2.1.2 The SimpleConsumer API

If you want more control over patition, you should use SimpleConsumer API, such as:

1. Read a message multiple times 2. Consume only some of the messages in a patition. Use transactions to ensure that a message is consumed only once

But when using this API, partition, offset, broker, leader and so on are no longer transparent to you and need to be managed by yourself. You need to do a lot of extra work:

1. Offset must be tracked in the application to determine which message 2. 0 should be consumed next. The application needs to know who the leader of each Partition is through the program. Changes to leader need to be handled

The general process for using SimpleConsumer API is as follows:

1. Find a "alive" broker and find leader 2 for each partition. Find out the follower 3. 0 for each partition. Define the request, which should describe what data the application needs. 4. Fetch data 5. Identify changes in leader and respond to them as necessary

The following is an explanation for high-level Consumer API.

2.2 consumer group

As mentioned in Section 2.2, the allocation unit for kafka is patition. Each consumer belongs to a group, and a partition can only be consumed by one consumer in the same group (which ensures that a message can only be consumed by a consuemr in the group), but multiple group can consume the partition at the same time.

One of the design goals of kafka is to achieve both offline processing and real-time processing. According to this feature, real-time processing systems such as spark/Storm can be used to process messages online, while Hadoop batch processing system can be used for offline processing, and data can be backed up to another data center, as long as the three belong to different consumer group.

2.3 consumption pattern

Consumer reads data from broker in pull mode.

The push model is difficult to adapt to consumers with different consumption rates, because the message delivery rate is determined by broker. Its goal is to deliver messages as quickly as possible, but it is easy to cause consumer to be too late to process messages, typically characterized by denial of service and network congestion. On the other hand, pull mode can consume messages at an appropriate rate according to the consumption power of consumer.

For Kafka, pull mode is more suitable, it can simplify the design of broker, consumer can independently control the rate of consuming messages, while consumer can control its own consumption mode, that is, it can consume in batches or one by one, and it can also choose different submission methods to achieve different transmission semantics.

2.4 consumer delivery guarantee

If consumer is set to autocommit,consumer, it automatically commit as soon as the data is read. If only this process of reading messages is discussed, then Kafka ensures Exactly once.

However, in practical use, the application does not end after consumer reads the data, but needs further processing, and the order of data processing and commit largely determines the consumer delivery guarantee:

1. After reading the message, commit before processing the message. In this mode, if consumer crash before processing the message after commit, the next time you start work again, you will not be able to read the message that has just been submitted but not processed, which corresponds to At most once 2. After reading the message, process it first and then commit. In this mode, if you consumer crash before commit after the message is processed, the message that has just not been commit will be processed the next time you restart your work. In fact, the message has already been processed. This corresponds to At least once. 3. If you want to achieve Exactly once, you need to coordinate the output of offset and the actual operation. The practice of Classic is to introduce two-phase commit. It would be more concise and versatile if you could keep offset and operation input in the same place. This approach may be better because many output systems may not support two-phase commit. For example, after consumer gets the data, it may put the data into HDFS. If you write the latest offset and the data itself to HDFS, you can ensure that either the output of the data and the update of offset are completed or not, and Exactly once is realized indirectly. (currently, as far as high-level API is concerned, offset is stored in Zookeeper and cannot be stored in HDFS, while SimpleConsuemr API's offset is maintained by itself and can be stored in HDFS.)

In summary, Kafka guarantees At least once by default and allows At most once to be implemented by setting producer asynchronous commits. Exactly once requires collaboration with external storage systems, and fortunately the offset provided by kafka can be used in this way very directly and easily.

2.5 consumer rebalance

Rebalance is triggered when consumer joins or exits, and partition changes (such as broker joining or exiting). The consumer rebalance algorithm is as follows:

1. Sort all the partirtion under the target topic and store them in PT 2. 0. Sort all the consumer under a consumer group, stored in CG, and the I consumer is marked as Ci 3. N=size (PT) / size (CG), rounding up 4. Revoke Ci's consumption right to the original assigned partition (I starts from 0) 5. Assign 1 partition from I * N to (iTun1) * N to Ci

In version 0.8.*, each consumer is only responsible for adjusting the partition consumed by it. In order to ensure the consistency of the entire consumer group, when a consumer triggers a rebalance, all other consumer in that consumer group should also trigger rebalance. This can lead to the following problems:

Any addition or decrease of broker or consumer in 1.Herd effect will trigger all consumer rebalance 2.Split Brain to determine which broker and consumer are down separately through zookeeper. Then different consumer may see different view from zookeeper at the same time, which is determined by the characteristics of zookeeper, which will result in incorrect reblance attempts. 3. The adjustment result is not controllable. All consumer do not know whether the rebalance of other consumer is successful, which may cause the kafka to work in an incorrect state.

Based on the above problems, kafka designers consider using the central coordinator to control the consumer rebalance in version 0.9.*, and then plan to implement the allocation scheme on the consumer client in terms of simplicity and verification requirements.

III. Note 3.1 the problem that producer cannot send messages

At the beginning, the kafka pseudo-cluster is built on the local machine, and the local producer client successfully publishes the message to broker. Then a kafka cluster is set up on the server, and the cluster is connected locally, but producer is unable to publish messages to broker (strange and not wrong). At first, it was suspected that iptables was not open, so the port was opened, but it didn't work yet (it started with code problems, version problems, and so on for a long time). In the end, there is no way to check the server.properties configuration one by one and find the following two configurations:

# The address the socket server listens on. It will get the value returned from # java.net.InetAddress.getCanonicalHostName () if not configured. # FORMAT: # listeners= security_protocol://host_name:port # EXAMPLE: # listeners=PLAINTEXT:// your.host.name:9092 listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set

# it uses the value for "listeners" if configured. Otherwise, it will use the value

# returned from java.net.InetAddress.getCanonicalHostName ().

# advertised.listeners=PLAINTEXT://your.host.name:9092

What are the knowledge points about kafka? so much for sharing here. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.