What are the basics of getting started with kafka 07/01 Update SLTechnology News&Howtos

What are the basics of getting started with kafka

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what are the basic knowledge of getting started with kafka". In daily operation, I believe that many people have doubts about the basic knowledge of getting started with kafka. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what is the basic knowledge of getting started with kafka?" Next, please follow the editor to study!

1.1.1 Kafka introduction Kafka was originally developed by Linkedin, is a distributed, partitioned, multi-replica, multi-producer, multi-subscriber, zookeeper-based coordinated distributed log system (can also be used as a MQ system), commonly can be used for web/nginx logs, access logs, message services, etc., Linkedin contributed to the Apache Foundation in 2010 and became a top open source project. The main application scenarios are: log collection system and message system. The main design objectives of Kafka are as follows: to provide message persistence with a time complexity of O (1), and to ensure constant-time access performance even for data above TB level. High throughput. Even on very cheap commercial machines, it is possible to support the transmission of 100K messages per second on a single machine. Support message partitioning between Kafka Server, and distributed consumption, while ensuring the sequential transmission of messages within each partition. * support both offline data processing and real-time data processing. Support for online horizontal scaling

There are two main messaging modes: peer-to-peer delivery mode and publish-subscribe mode. Most messaging systems choose publish-subscribe mode. Kafka is a publish-subscribe model.

For message middleware, messages are divided into two modes: push and pull. Kafka only pulls messages, but does not push messages. You can push messages through polling.

Each record consists of a key, a value, and a timestamp.

Kafka clusters are managed by topic. A topic can have multiple partitions, and a partition can have multiple replica partitions.

Kafka runs as a cluster on one or more servers that can span multiple data centers.

Kafka has four core API:

Producer API: allows applications to publish stream streams to one or more Kafka topics.

Consumer API: allows an application to subscribe to one or more topics and process the stream of records generated for it.

Streams API: allows the application to act as a stream processor, using the input stream of one or more topics and generating the output stream of one or more output topics, thus effectively converting the input stream into the output stream.

Connector API: allows you to build and run reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector for a relational database might capture all changes to the table.

1.1.2 Kafka advantage 1. High throughput: a single machine processes tens of millions of messages per second. It maintains stable performance even though many TB messages are stored. two。 High performance: a single node supports thousands of clients and guarantees zero downtime and zero data loss. 3. Persistent data store: persists messages to disk. Prevent data loss by persisting data to the hard disk and replication. 1. Zero copy 2. Read sequentially and write 3 sequentially. Use the page cache of Linux 4. Distributed system, easy to expand outward. All Producer, Broker, and Consumer will have multiple, all distributed. The machine can be expanded without downtime. Multiple Producer and Consumer may be different applications. 5. Reliability-Kafka is distributed, partitioned, replicated and fault tolerant. 6. Client state maintenance: the state in which the message is processed is maintained on the consumer side, not on the server side. It can balance automatically when it fails. 7. Scenarios that support online and offline. 8. Multiple client languages are supported. Kafka supports Java, .NET, PHP, Python and other languages. 1.1.3 Kafka application scenario log collection: a company can use Kafka to collect Log of various services and open it to various Consumer; messaging systems through Kafka as a unified interface service: decoupling producers and consumers, caching messages, etc. User activity tracking: Kafka is often used to record various activities of Web users or App users, such as browsing, searching, clicking and other activities. These activity information is published to the Topic of Kafka by each server, and then consumers subscribe to these Topic to do real-time monitoring and analysis, and can also be saved to the database; operational indicators: Kafka is also often used to record operational monitoring data. This includes collecting data from various distributed applications, producing centralized feedback on various operations, such as alarms and reports, and streaming: such as Spark Streaming and Storm. 1.1.4 the data units of basic schema messages and batch Kafka are called messages. Think of a message as a "data row" or a "record" in the database. The message consists of an array of bytes. A message has a key, and the key is a byte array. Keys are used when messages are written to different partitions in a controlled manner. To improve efficiency, messages are written to Kafka in batches. A batch is a set of messages that belong to the same topic and partition. Dividing messages into batches can reduce network overhead. The larger the batch, the more messages are processed per unit time, and the longer the transmission time of a single message. Batch data will be compressed, which can improve the data transmission and storage capacity, but require more computing processing. Pattern message pattern (schema) has many options available to make it easy to understand. Such as JSON and XML, but they lack strong typing capabilities. Many developers of Kafka like to use Apache Avro. Avro provides a compact serialization format in which the schema is separated from the message body. When the schema changes, there is no need to regenerate the code, it also supports strong typing and schema evolution, and its versions are both forward-compatible and backward-compatible. Data format consistency is important for Kafka because it eliminates the coupling between message read and write operations. Messages for topics and partitions Kafka are classified by topic. A topic is comparable to a table in a database or a folder in a file system. Topics can be divided into several partitions, and one topic is distributed in the Kafka cluster through partitions, providing the ability to scale out.

Producers and consumers create messages. Consumer consumption news. A message is posted to a specific topic. By default, producers distribute messages evenly across all partitions of the topic: 1. Directly specify the partition of the message 2. Partition 3 is obtained according to the key hash module of the message. Polls the specified partition. Consumers use offsets to distinguish messages that have been read, thus consuming messages. Consumers are part of the consumer group. The consumer group ensures that each partition can only be used by one consumer to avoid repeated consumption.

Broker and Cluster

A separate Kafka server is called broker. Broker receives messages from producers, sets offsets for messages, and submits messages to disk for storage. Broker provides a service for consumers to respond to requests to read partitions and return messages that have been submitted to disk. A single broker can easily handle thousands of partitions and millions of messages per second.

Each cluster has a broker that is the cluster controller (automatically elected from the active members of the cluster)

The controller is responsible for the management: assign the partition to the broker monitoring broker

A partition in the cluster belongs to a broker, and the broker is called the partition leader.

A partition can be assigned to multiple broker, and partition replication occurs.

Replication of partitions provides message redundancy and high availability. The replica partition is not responsible for reading and writing messages.

1.1.5 Core concept 1.1.5.1 Producer producer creation message. This role publishes messages to the topic of Kafka. After the broker receives the message sent by the producer, broker appends the message to the segment file currently used to append the data. In general, a message is posted to a specific topic. 1. By default, messages are evenly distributed to all partitions of the topic by polling. two。 In some cases, the producer writes the message directly to the specified partition. This is usually achieved through a message key and a divider, which generates a hash value for the key and maps it to the specified partition. This ensures that messages containing the same key will be written to the same partition. 3. Producers can also use custom dividers to map messages to partitions according to different business rules. 1.1.5.2 Consumer consumers read messages. 1. Consumers subscribe to one or more topics and read them in the order in which messages are generated. two。 The consumer distinguishes between messages that have been read by checking the offset of the message. The offset is another kind of metadata, which is an increasing integer value that Kafka adds to the message when it is created. The offset of each message is unique in a given partition. The consumer saves the last message offset read by each partition on Zookeeper or Kafka, and its read state is not lost if the consumer shuts down or restarts. 3. Consumers are part of the consumer group. The group ensures that each partition can only be used by one consumer. 4. If a consumer fails, other consumers in the consumer group can take over the work of the failed consumer, rebalance, and redistribute by partition.

1.1.5.3 Broker A stand-alone Kafka server is called broker. Broker provides a service for consumers to respond to requests to read partitions and return messages that have been submitted to disk. 1. If a topic has N partition and the cluster has N broker, then each broker stores one partition of that topic. two。 If a topic has N partition and the cluster has N broker, then N broker stores a partition of the topic, and the remaining M broker does not store the partition data of the topic. 3. If a topic has N partition and the number of broker in the cluster is less than N, then one broker stores one or more partition of that topic. In the actual production environment, try to avoid this situation, which can easily lead to Kafka cluster data imbalance. Broker is part of the cluster. Each cluster has a broker that also acts as the cluster controller (automatically elected from among the active members of the cluster). The controller is responsible for management, including assigning partitions to the broker and monitoring the broker. In a cluster, a partition is subordinate to a broker, and the broker is called the partition leader.

1.1.5.4 Topic every message published to the Kafka cluster has a category, which is called Topic. Messages with physically different Topic are stored separately. The topic is like the table of the database, especially the logical table after the subdatabase and table. 1.1.5.5 Partition

Topics can be divided into several partitions, each of which is a commit log.

Messages are written to the partition as appended and then read in first-in, first-out order.

The order of messages cannot be guaranteed across the entire topic, but the order of messages within a single partition can be guaranteed.

Kafka achieves data redundancy and scalability through partitioning.

In scenarios where the consumption order of messages needs to be strictly guaranteed, the number of partition needs to be set to 1.

1.1.5.6 ReplicasKafka uses topics to organize data, each topic is divided into several partitions, and each partition has multiple copies. Those copies are saved on broker, and each broker can hold hundreds of books belonging to different themes and partitions. There are two types of copies: a leader copy for each partition. To ensure consistency, all producer and consumer requests pass through this copy. Copies of followers other than the leader are copies of followers. Follower copies do not process requests from the client. Their only task is to copy messages from the leader and keep them in the same state as the leader. If the leader collapses, one of the followers will be promoted to the new leader. 1.1.5.7 when the Offset producer Offset message is written, each partition has an offset, which is the producer's offset and the latest and largest offset for this partition. Sometimes the offset of a partition is not specified, and the kafka does the job for us.

Consumer Offset

This is the case of offset in a certain partition. The offset written by the producer is the latest maximum value of 12. When Consumer A consumes, it starts from 0 to 9, and the consumer's offset is recorded in 9. Consumer B records 11. The next time they come to spend, they can choose to continue their last consumption, of course, they can choose to spend from scratch, or they can skip to the recent record and start spending "now". 1.1.5.8 all replicas in the replica partition are collectively referred to as AR (Assigned Repllicas). All replicas of AR=ISR+OSR1.1.5.8.2 ISR that maintain some degree of synchronization with leader replicas (including Leader) form ISR (In-Sync Replicas), and the ISR collection is a subset of the AR collection. The message will be sent to the leader copy first, and then the follower copy can pull the message from the leader copy for synchronization. During the synchronization period, the follower copy will lag behind the leader copy to a certain extent. By "some degree", I mean the tolerable lag range, which can be configured by parameters. 1.1.5.8.3 replicas that lag too much in synchronization between OSR and leader replicas (excluding leader) constitute OSR (Out-Sync Relipcas). Under normal circumstances, all follower replicas should be synchronized to a certain extent with leader replicas, that is, the AR=ISR,OSR collection is empty. 1.1.5.8.4 HWHW is the abbreviation of High Watermak, commonly known as high water level. It represents the offset of a specific message. Consumers can only pull messages before this offset. 1.1.5.8.5 LEOLEO is an abbreviation for Log End Offset, which represents the offset of the next message to be written in the current log file.

At this point, the study of "what are the basic knowledge of getting started with kafka" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.