How to learn Kafka 07/19 Update SLTechnology News&Howtos

How to learn Kafka

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to carry out Kafka learning, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Apache Kafka I. message queuing classification 1.1 point-to-point

The message producer produces the message and sends it to the queue, and then the message consumer takes the message from the queue and consumes the message

Note:

1. After messages are consumed, there is no longer storage in queue, so it is impossible for message consumers to consume messages that have already been consumed.

2.Queue supports multiple consumers, but for a message, only one consumer can consume

1.2 publish / subscribe

The message producer (publisher) publishes the message to topic, and multiple message consumers (subscribers) consume the message. Unlike peer-to-peer, messages published to topic are consumed by all subscribers

2. Message queuing comparison 2.1 RabbitMQ

Support for many protocols, very heavyweight message queues, good support for routing (Routing), load balancing (Load balance) or data persistence

2.2 ZeroMQ

The so-called fastest message queuing system, especially for high-throughput demand scenarios, is good at advanced / complex queues, but the technology is also complex and only provides non-persistent queues.

2.3 ActiveMQ

A subitem under Apache, similar to ZeroMQ, that implements queues with proxy and peer-to-peer technology

2.4 Redis

A NoSQL database of key-value, but it also supports MQ function. It has a small amount of data and its performance is better than that of RabbitMQ. If the data exceeds 10K, it is unbearably slow.

III. Introduction to Kafka 3.1 introduction to Kafka

Kafka is a distributed publish-subscribe messaging system. It was originally developed by Linkedln, written in Scala, and later became part of the Apache project. Kafka is a distributed, divisible (partition processing), multi-subscriber, redundant backup persistent log service. It is mainly used to deal with active streaming data.

3.2 Kafka Featur

1. Provide high throughput for both publish and subscribe. It is understood that Kafka can produce about 250000 messages per second (50m) and process 550000 messages per second (110m).

two。 Persistence operations can be carried out. Persist messages to disk, so they can be used for bulk consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to the hard disk and replication

3. Distributed system, easy to expand outward. All producer, broker, and consumer will have multiple, all distributed. The machine can be expanded without downtime

4. The state in which the message is processed is maintained on the client side, not on the server side

5. Scenarios that support online (online) and offline (offline)

IV. Kafka architecture

Important note:

1. There is no single-read Conmuser in Kafka system, it will exist in one Conmuser Group,Conmuser Group and there will be multiple Conmuser in it.

two。 Consumer Group can be regarded as a virtual Consumer, which consumes the data of a specific Topic, but the specific execution is performed by the Consumer in the Consumer Group. Consumer is a logical concept that does not exist, but there is a Consumer in the Consumer Group, and a Consumer Group corresponds to the Consumer in the Topic,Consumer Group corresponds to the partition in the Topic.

3. What do multiple consumers in a consumer group correspond to?

For the data of different Partition in the Topic group, the data in one Partition is processed by one Consumer, and the data in another Partition is processed by another Consumer. Of course, they must be Consumer in the same Consumer Group, which achieves parallel consumption (each Consumer corresponds to the data in a Partition).

Why does 4.Kafka have the concept of Partition?

The advantage is that the processing speed is faster. Different Conmuser consumes messages from different Partition, and data consumption becomes parallel.

5. The core concept of Kafka 5.1 Producer

Especially the producer of a message.

5.2 Consumer

Especially the consumer of the message

5.3 Consumer Group

Consumer group, which can consume Partition messages in Topic in parallel

5.4 Broker

Cache proxy, one or more servers in a Kafka cluster are collectively referred to as broker

1.message persists storage in broker by appending log. And partition (patitions)

two。 In order to reduce the number of disk writes, broker will display messages buffer up, when the number of messages (or size) reaches a certain threshold, and then flush to disk, thus reducing the number of disk IO calls

3.Broker does not have a copy mechanism. Once the broker goes down, the message of the broker will not be available (but there is a copy of the message, so you can synchronize the copy of the message to other broker).

4.Broker does not save the status of subscribers, but is saved by subscribers themselves

5. Statelessness makes it difficult to delete messages (messages that may be deleted are being subscribed) Kafka uses time-based SLA (service level assurance), and messages are deleted after a certain period of time (usually 7 days)

6. Message subscribers can rewind back to any location for re-consumption. When subscribers fail, they can choose the smallest offset (id) to re-read consumer messages.

5.5 Topic

Specifically refers to the different classification of message sources (feeds of messages) processed by Kafka

5.6 Partition

1.Topic physical packets, a topic can be divided into multiple partition, each partition is an ordered queue. Each message in partition is assigned an ordered id (offset)

Purpose of the Partitions partition of 2.Kafka

2.1 kafka is based on file storage. Through partitioning, you can split the log content to multiple server to prevent the file size from reaching the online size of the click disk, and each partition will be saved by the current server (kafka instance).

2.2 A topic can be split into any number of partitions to improve the efficiency of message storage / consumption.

2.3 more partitions means more consumer can be accommodated, effectively improving the ability of concurrent consumption

5.7 Message

1. Message is the basic unit of communication. Each producer can publish some messages to a topic (topic).

Message in 2.Kafka is organized with topic as the basic unit, and different topic are independent of each other. Each topic has several different partition (several partition for each topic are specified when the topic is created) each partition stores a portion of the Message

Each Message in 3.partition contains the following three attributes

Attribute name data type

Offset long

MessageSize int32

The specific content of data mssage

5.8 Producers

The process by which message and data producers publish messages to a topic in kafka is called producers

1.producer publishes the message to the specified topic, and the producer can also decide which partition the message belongs to. For example, based on "round-robin" method or through some other algorithms, etc.

two。 Asynchronous sending: batch sending can effectively improve the sending efficiency. Kafka producer's asynchronous sending mode allows batch sending, first caching messages in memory, and then sending them in batches with one request.

5.9 Consumers

1. Message and data consumers, the process of subscribing to topics and processing its published messages is called consumers

two。 In Kafka, we can think of a group as a "subscriber". Each partition in a Topic will only be consumed by one consumer in a "subscriber", but a consumer can consume messages in multiple partition (when the consumer data is less than the number of Partition).

3. Note: the design principle of kafka determines that for a topic, there cannot be more than the number of consumer in the same group consumed at the same time, otherwise it will mean that some consumer will not be able to get messages.

6. Persistence of Kafka 6.1 data persistence

1. It is found that linear access to disk is often much faster than random memory access.

two。 Traditional use of memory as disk cache

3.Kafka writes the data directly to the log file

6.2 Log data persistence characteristics

1. Write operation: by appending data to the file

two。 Read operation: when reading, just read from the file.

6.3 advantage

1. Read operations do not block writing and other operations, and data size does not affect performance.

two。 Hard disk space with no capacity limit (relative to memory) to build a message system

3. Linear access to the disk, fast, can be saved for any period of time

6.4 implementation of persistence

1. A Tipic can be thought of as a class message, each topic will be divided into multiple partition, and each partition is an append log file at the storage level. Any message published to this partition is appended directly to the end of the log file. The location of each message in the file is called offset (offset), and the partition is stored in the file system as a file.

According to the configuration requirements in broker, 2.Logs files are retained for a certain period of time and deleted to free disk space (default is 7 days)

Description: Partition is a physical packet of Topic. A topic can be divided into multiple partition, and each partition is an ordered queue. Each message in Partition is assigned an ordered id (offset)

6.5 Index

Index data files: sparse storage, creating an index every other byte of data. The following is a schematic diagram of an index of partition

Note:

1. Now index 1, 3, 6, 8 is established. If you want to find 7, you will first find 8, then find an index 6 after 8, and then do a dichotomy between the two indexes to find the position of 7.

two。 Log files will also be segement and divide and conquer.

7. Distributed implementation of Kafka 7.1 Kafka distributed architecture diagram

Note:

1. When the producer sends the message to the Kafka, it will immediately notify the ZooKeeper,zookeeper of the relevant actions that will be watch. When the watch changes to the relevant data, it will notify the consumer to consume the message.

two。 Consumers actively go to Pull (pull) messages in kafka, which can reduce the pressure on Broker, because messages in Broker are stateless and Broker does not know which messages can be consumed.

3. When consumers consume a message, they also have to notify ZooKeeper. Zookeeper records the data consumed so that it can be restored when there is a problem with the system, and you can know which messages have been consumed.

7.2 deployment architecture diagram for production environment

Description:

1.Name Server cluster refers to Zookeeper cluster

Brief introduction of Kafka Communication Protocol 8.1Communication Protocol

The communication protocol of 1.Kafka is mainly about the communication protocol used by consumer to pull data.

2.Kafka 's Producer, Broker and Consumer adopt a set of self-designed protocols based on the TCP layer, customized according to business requirements, rather than implementing a set of communication protocols similar to Protocol Buffer

3. Basic data type

3.1fixed-length data types: int8,int16,int32 and int64, corresponding to byte,short,int and long in Java

3.2 variable length data types: bytes and string. The variable-length data type consists of two parts, a signed integer N (the length of the identified content) and N bytes of content. Where N is-1, the content is null. The length of Bytes is identified by int32, and the length of string is represented by int16.

3.3Array: the array consists of two parts, N and N elements of an array length with an int32 type digital ID.

8.2 detailed description of communication protocol

The basic unit of 1.Kafka communication is Request/Response.

two。 Basic structure:

RequestOrResponse-> MessageSize (RequestMessage | ResponseMessage)

Name type description ApiKeyInt16 identifies the API number of the request ApiVersionInt16 identifies the API version of the request. With the version, it can be backwards compatible with CorrelationIdInt32 a number specified by the client to uniquely identify the id of the request, and the server will write the same CorrelationId to the Response after processing the request, so that the client can match a request with a string specified by the ClientIdstring client to describe the client. Will be used for logging and monitoring, which uniquely identifies the specific content of a client Request-Request

3. Communication process:

3.1 the client opens Socket with the server

Write an int32 number to Socket (the number indicates how many bytes of Request are sent this time)

3.3.The server first reads out an integer of int32 to get the size of this Request.

3.4 then read the data corresponding to the number of bytes to get the specific content of the Request

3.5 after the server processes the request, it uses the same send oath to send the response.

4.RequestMessage structure

4.1RequestMessage-> ApiKey ApiVersion CorrelationId ClientId Request

Name type description MessageSizeint32 identifies the length of RequestMessage or ResponseMessage RequestMessageResponseMessage-- identifies the contents of Request or Response

5.ResponseMessage

5.1ResponseMessage-> CorrelationId Response

The name type description CorrelationIdint32 corresponds to the CorrelationIdResponse-- of the Request corresponds to the Response of the Request. The fields of the Response of different Request are different.

Kafka adopts the classic Reactor (synchronous IO) mode, that is, one Acceptor responds to the client's connection request and N Processor reads the data. This mode can build a high-performance server.

6.Message:Producer production message, key-value pair

6.1Message-> Crc MagicByte Attributes Key Value

The name type describes the version of the message format identified by CRCInt32 that identifies the message (excluding the CRC field itself). It is used for backward compatibility. The current value is 0AttributesInt8 to identify the metadata of the message. At present, the lowest two digits are used to identify the Key of the message in the compressed format Keybytes, and the Value of the message can be identified for the nullValuebytes. Kafka supports message nesting, that is, putting one message into another as a Value

Description:

CRC is a message verification method. After Consumer gets the data, CRC will obtain the size of MessageSize and MessageData for comparison. If it is inconsistent, then the data of this operation will not be received by Consumer, and will be processed only if it is processed all the time. A check to prevent messages from being damaged and lost during transmission

7.MessageSet: used to combine multiple Message, adding offset and MessageSize to each Message

7.1MessageSet-- > [offset MessageSize Message]

The name type describes OffsetInt64, which is used as the sequence number in log. Producer does not know what the specific value is when producing the message. You can fill in a number to mark the size of the Message, the size of the MessageSizeInt32, and the specific content of the Message. For its format, please see the summary above.

The relationship between 8.Request/Response and Message/messageSet

8.1 Request/Response is the structure of the communication layer, which is similar to TCP when compared with the seven-layer model of the network.

8.2 Message/MessageSet defines the structure of the business layer, similar to the HTTP layer in the network layer 7 model. Message/MessageSet is just a data structure in Request/Response 's payload.

Note: Kafka's communication protocol does not include Schema, and the format is relatively simple. The advantage of this design is that the Overhead of the protocol itself is small. In addition, multiple Message are compressed into one phase to increase the compression ratio, so the amount of data transmitted on the network will be less.

IX. Transaction definition of data transmission

1.at most once: once at most, this is similar to the "non-persistent" message in JMS. Once sent, regardless of success or failure, it will not be retransmitted.

Consumers fetch (get) the message, then save the offset, and then process the message; when client saves the offset, but there is an exception in the message processing, resulting in some messages can not be processed, then the "unprocessed" message can not be fetch to, this is "at most once"

2.at least once: the message is sent at least once. If the message is not received successfully, it may be resent until it is received successfully.

The consumer fetch the message, then processes the message, and then saves the offset. If the message is processed successfully, but the zookeeper exception during the save offset phase causes the save operation not to be executed successfully, this causes the next fetch to get the message that has been processed last time, which is "at least once", because the offset does not return to the normal offset state even if it is submitted to the zookeeper,zookeeper.

Note: usually "at least once" is our first choice (compared to at most once, it is better to receive data repeatedly than to lose data)

3.exactly once: messages are sent only once

There is no strict implementation in Kafka (based on 2-phase commit, transaction), and we don't think this strategy is necessary in kafka.

10. Kafka installation

1. Download and upload kafka to the server

two。 Extract and move to the / usr/local directory

3. Start the service

Start the zookeeper service #. / zookeeper-server-start.sh.. / config/zookeeper.properties > / dev/null 2 > & 1 &

Start the kafka service #. / kafka-server-start.sh.. / config/server.properties > / dev/null 2 > & 1 &

3.3.Create topic:

. / kafka-topics.sh-- create-- zookeeper localhost:2181-- replication-factor 1-- partitions 1-- topic test

3.4 View topic

. / kafka-topics.sh-list-zookeeper localhost:2181

3.5 View topic details

. / kafka-topics.sh-- describe-- zookeeper localhost:2181-- topic test

3.6 Delete theme

. / kafka-run-class.sh kafka.admin.TopicCommand-- delete-- topic test-- zookeeper 192.168.31.220purl 2181

11. Kafka client operation 11.1 create Producer

. / kafka-console-producer.sh-broker-list localhost:9092-topic test1

11.2 create Consumer

. / kafka-console-consumer.sh-- zookeeper localhost:2181-- topic test1-- from-beginning

11.3 parameters are viewed using help group information

Producer parameter view:. / kafka-console-producer.sh

View consumer parameters:. / kafka-console-consumer.sh

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.