Kafka distributed cluster 04/22 Update SLTechnology News&Howtos

Kafka distributed cluster

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

I. brief introduction

1. Message transmission process

Kafka is a distributed,partitioned,replicated commit logservice . It provides features similar to JMS, but is completely different in design and implementation, and it is not an implementation of the JMS specification. Kafka classifies messages according to Topic when they are saved, the sender becomes Producer and the message receiver becomes Consumer. In addition, the kafka cluster is composed of multiple kafka instances, and each instance (server) becomes broker. Both kafka clusters, producer and consumer rely on zookeeper to ensure system availability that the cluster holds some meta information.

Producer, the producer, sends messages to the Kafka cluster. Before sending messages, it classifies the messages, namely Topic. The figure shows that two producer send messages classified as topic1, and the other sends messages of topic2.

Topic is the topic. Messages can be classified by specifying a topic, and consumers can only focus on the messages in the Topic they need.

Consumer is the consumer, who constantly pulls messages from the cluster by establishing a long connection with the kafka cluster, and then can process these messages.

2 、 Topics/logs

A Topic can be thought of as a class of messages, each topic will be divided into multiple partition (regions), and each partition is an append log file at the storage level. Any message published to this partition is appended directly to the end of the log file. The location of each message in the file is called offset (offset). Offset is a long number, which is the only message marked. It only marks a message. Kafka does not provide any additional indexing mechanism to store offset, because there are few "random reads and writes" of messages in kafka.

When it comes to the storage of kafka, we have to mention partitions, that is, partitions. When creating a topic, you can specify the number of partitions at the same time. The more the number of partitions, the greater the throughput, but the more resources are needed, which will also lead to higher unavailability. After receiving the message sent by the producer, kafka will store the message in different partitions according to the balance policy.

The kafka server message storage strategy is shown in the figure.

The difference between kafka and JMS (Java Message Service) implementation (activeMQ) is that even if the message is consumed, the message is still not deleted immediately. The log file will be deleted after a certain period of time according to the configuration requirements in broker. For example, if the log file is retained for 2 days, the file will be deleted after 2 days, regardless of whether the messages in it are consumed or not. Kafka uses this simple means to free disk space and reduce disk IO expenses for changes to the file content after message consumption.

For consumer, it needs to save the offset of consuming messages, and consumer controls the preservation and use of offset. When consumer consumes messages normally, offset will be "linearly" driven forward, that is, messages will be consumed sequentially. In fact, consumer can consume messages in any order, it just needs to reset offset to any value. (offset will be saved in zookeeper, see below)

The kafka cluster hardly needs to maintain any consumer and producer state information, which is saved by zookeeper; so the client implementations of producer and consumer are very lightweight, and they can leave at will without additional impact on the cluster.

Partitions is designed for many purposes. The most fundamental reason is that kafka is based on file storage. Through partitioning, the log content can be distributed to multiple server to prevent the file size from reaching the upper limit of a stand-alone disk, and each partiton will be saved by the current server (kafka instance). A topic can be split into as many partitions as you want to save / consume messages. In addition, more partitions means more consumer can be accommodated, effectively improving the ability of concurrent consumption. See the following for the specific principle.

3. Distribution (distribution)

Multiple partitions of a Topic are distributed on multiple server in a kafka cluster; each server (kafka instance) is responsible for reading and writing messages in partitions; in addition, kafka can also configure the number of backups required by partitions (replicas), and each partition will be backed up to multiple machines to improve availability.

Based on the replicated scheme, it means that multiple backups need to be scheduled; each partition has a server of "leader"; leader is responsible for all read and write operations, and if leader fails, other follower will take over (become the new leader); follower just monotonously follow up with leader and synchronize messages.. Thus it can be seen that server as a leader carries all the request pressure, so from the perspective of the cluster as a whole, how many partitions means how many "leader" there are. Kafka will evenly distribute the "leader" on each instance to ensure the stability of the overall performance.

Producers

Producer publishes the message to a specified Topic, and Producer can also decide which partition; to attribute the message to, such as based on "round-robin" or through other algorithms.

Consumers

In essence, kafka only supports Topic. Each consumer belongs to one consumer group;. Conversely, there can be multiple consumer in each group. Messages sent to Topic will only be subscribed to one consumer consumption in each group of this Topic.

If all consumer have the same group, this is similar to the queue pattern; messages will be load balanced between consumers.

If all consumer have different group, this is publish-subscribe; the message will be broadcast to all consumers.

In kafka, messages in a partition are consumed by only one consumer in group; consumer message consumption in each group is independent of each other; we can think of a group as a "subscriber", and each partions in a Topic is consumed by only one consumer in a "subscriber", but a consumer can consume messages in multiple partitions. Kafka can only guarantee that messages in a partition are consumed by a consumer, the messages are sequential. In fact, from a Topic point of view, the message is still not orderly.

The design principle of kafka determines that for a topic, there cannot be more than the number of partitions in the same group to consume at the same time, otherwise it will mean that some consumer will not get the message.

Guarantees

1) messages sent to partitions will be appended to the log in the order in which they are received

2) for consumers, the order in which they consume messages is the same as that in the log.

3) if the "replicationfactor" of Topic is N, then one kafka instance is allowed to fail.

Interaction with producers

When sending messages to the kafka cluster, the producer can send the message to the specified partition by specifying the partition, or by specifying the equalization policy to send the message to different partitions. If not specified, the default random equalization strategy will be used to store the message in different partitions randomly.

Interaction with consumers

When consumers consume messages, kafka uses offset to record the current consumption location. In the design of kafka, there can be multiple different group to consume messages under the same topic at the same time. As shown in the figure, we have two different group consumption at the same time, and their consumption location offset does not interfere with each other.

For a group, the number of consumers should not exceed the number of partitions, because in a group, each partition can only be bound to one consumer, that is, a consumer can consume multiple partitions, and a partition can only be consumed by one consumer.

Therefore, if the number of consumers in a group is greater than the number of partitions, the extra consumers will not receive any messages.

Second, use the scene

1 、 Messaging

For some conventional messaging systems, kafka is a good choice, while partitons/replication and fault tolerance can make kafka have good scalability and performance advantages. However, up to now, we should be well aware that kafka does not provide enterprise-level features such as "transactional", "message transmission guarantee (message confirmation mechanism)" and "message grouping" in JMS; kafka can only use the message system as a "routine". To a certain extent, it has not ensured the absolute reliability of message sending and receiving (for example, message retransmission, message transmission loss, etc.)

2 、 Websit activity tracking

Kafka can be used as the best tool for website activity tracking; information such as web pages / user actions can be sent to kafka. And real-time monitoring, or offline statistical analysis.

3 、 Log Aggregation

The characteristics of kafka determine that it is very suitable to be a "log collection center"; application can "batch" and "asynchronously" send operation logs to the kafka cluster instead of saving them locally or DB; kafka can batch submit messages / compress messages, etc., which is almost inexpensive to the producer side. At this time, the server side can make hadoop and other systematic storage and analysis systems.

Third, design principle

The original design intention of kafka is that as a unified information collection platform, it can collect feedback information in real time, and needs to be able to support a large amount of data and have good fault tolerance.

1. Persistence

2. Performance

3. Producer

4. Consumers

5. Message transmission mechanism

6. Copy backup

7. Log

8. Distribution

IV. Main configuration

1. Broker configuration

2. Main configuration of Consumer

3. Main configuration of Producer

5. Steps of building kafka cluster

1. System environment

Hostnam

System

Zookeeper version

Master

CentOS7.4

3.4.12

192.168.56.129

Slave1

CentOS7.4

3.4.12

192.168.56.130

Slave2

CentOS7.4

3.4.12

192.168.56.131

2. Temporarily turn off the firewall and selinux

3. Software download

Download address: http://kafka.apache.org/downloads.html

Note: download the latest binary tgz package

4. Set up zookeeper cluster

Remarks: friends can refer to the previous article.

5. Kafka cluster

5.1.upload kafka to / home according to the zookeeper cluster server above

5.2. Decompression

[root@master home] # tar-zxvf kafka_2.12-2.0.0.tgz

[root@master home] # mv kafka_2.12-2.0.0 kafka01

5.3. Configuration file

[root@master home] # cd / home/kafka01/config/

Note: the broker.id,log.dirs,zookeeper.connect in the server.properties file must be modified according to the actual situation, and other items must be modified as needed. The configuration of master is as follows:

Broker.id=1

Port=9091

Num.network.threads=2

Num.io.threads=2

Socket.send.buffer.bytes=1048576

Socket.receive.buffer.bytes=1048576

Socket.request.max.bytes=104857600

Log.dirs=/var/log/kafka/kafka-logs

Num.partitions=2

Log.flush.interval.messages=10000

Log.flush.interval.ms=1000

Log.retention.hours=168

# log.retention.bytes=1073741824

Log.segment.bytes=536870912

Num.replica.fetchers=2

Log.cleanup.interval.mins=10

Zookeeper.connect=192.168.56.129:2181192.168.56.130:2181192.168.56.131:2181

Zookeeper.connection.timeout.ms=1000000

Kafka.metrics.polling.interval.secs=5

Kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter

Kafka.csv.metrics.dir=/tmp/kafka_metrics

Kafka.csv.metrics.reporter.enabled=false

Start the service (master)-provided that the zookeeper of the three nodes has been started

[root@master kafka01] #. / bin/kafka-server-start.sh config/server.properties &

Add:

Problem: & you can make the program run in the background, but once you disconnect the ssh terminal, the background Java program will also be terminated.

Solution: start with a shell script

[root@master kafka01] # cat start.sh

#! / bin/bash

Cd / home/kafka01/

. / bin/kafka-server-start.sh config/server.properties &

Exit

Authorization, just run.

[root@master kafka01] # chmod + x start.sh

5.5.Config slave1 and slave2

The slave1 configuration is as follows:

Broker.id=2

Port=9092

Log.dirs=/var/log/kafka

Zookeeper.connect=192.168.56.129:2181192.168.56.130:2181192.168.56.131:2181

Just start it.

The slave2 configuration is as follows:

Broker.id=3

Port=9093

Log.dirs=/var/log/kafka

Zookeeper.connect=192.168.56.129:2181192.168.56.130:2181192.168.56.131:2181

Just start it.

6. Testing

Kafka manages the same kind of data through topic, and it is more convenient for the same kind of data to use the same topic to process the data.

6.1.Create a Topic

[root@master kafka01] # bin/kafka-topics.sh-- create-- zookeeper 192.168.56.129 zookeeper 2181-- replication-factor 1-- partitions 1-- topic test

View

[root@master kafka01] # bin/kafka-topics.sh-- list-- zookeeper 192.168.56.129purl 2181

6.2. Create a message consumer

[root@master kafka01] # bin/kafka-console-consumer.sh-- bootstrap-server 192.168.56.129 topic test-- from-beginning

After the consumer has been created, no data has been printed here because no data has been sent.

But don't worry, don't close the terminal, open a new terminal, and then let's create the first message producer.

6.3. Create a message producer

Open a new terminal in the kafka decompression directory and type

[root@master kafka01] # bin/kafka-console-producer.sh-- broker-list 192.168.56.129 topic test

After sending the message, we can go back to our message consumer terminal and see that the message we just sent has been printed out in the terminal.

Zookeeper View topic

This is it, the road to common progress!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.