How to parse the distributed message system Kafka 07/13 Update SLTechnology News&Howtos

How to parse the distributed message system Kafka

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to parse the distributed message system Kafka, aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a distributed, divisible, redundant backup persistent logging service. It is mainly used to deal with active streaming data.

In the big data system, we often encounter a problem, the whole big data is composed of various subsystems, and the data needs to flow continuously with high performance and low delay in each subsystem. The traditional enterprise message system is not very suitable for large-scale data processing. Kafka has emerged in order to handle both online applications (messages) and offline applications (data files, logs). Kafka can serve two purposes:

1. Reduce the complexity of system networking.

two。 To reduce the programming complexity, each subsystem is no longer to negotiate the interface with each other, each subsystem is like a socket plugged into the socket, and Kafka undertakes the role of high-speed data bus.

The main features of Kafka:

1. Provide high throughput for both publish and subscribe. It is understood that Kafka can produce about 250000 messages per second (50 MB) and process 550000 messages per second (110 MB).

two。 Persistence operations can be carried out. Persist messages to disk, so they can be used for bulk consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to the hard disk and replication.

3. Distributed system, easy to expand outward. All producer, broker, and consumer will have multiple, all distributed. The machine can be expanded without downtime.

4. The state in which the message is processed is maintained on the client side, not on the server side. It can balance automatically when it fails.

5. Scenarios that support online and offline.

The architecture of Kafka:

The overall architecture of Kafka is very simple and is explicitly distributed. Producer, broker (kafka), and consumer can all have multiple. Producer,consumer implements the interface for Kafka registration, and data is sent from producer to broker,broker to act as an intermediate cache and distribution. Broker distributes consumer registered with the system. The role of broker is similar to caching, that is, caching between active data and offline processing systems. The communication between client and server is based on TCP protocol, which is simple, high-performance and independent of programming language. Several basic concepts:

1.Topic: specifically refers to the different categories of message sources (feeds of messages) processed by Kafka.

2.Partition:Topic physical packets, a topic can be divided into multiple partition, each partition is an ordered queue. Each message in partition is assigned an ordered id (offset).

3.Message: messages are the basic unit of communication, and each producer can publish some messages to a topic (topic).

4.Producers: a producer of messages and data. The process of publishing a message to a topic in Kafka is called producers.

5.Consumers: a consumer of messages and data, the process of subscribing to topics and processing its published messages is called consumers.

6.Broker: cache proxy. One or more servers in a Kafka cluster are collectively referred to as broker.

The process of sending a message:

1.Producer publishes the message to the partition of the specified topic according to the specified partition method (round-robin, hash, etc.)

After receiving the message from Producer, the 2.kafka cluster persists it to the hard disk and retains the specified length of time (configurable), regardless of whether the message is consumed.

3.Consumer pull data from the kafka cluster and controls the offset for getting messages

The design of Kayka:

1. Throughput

High throughput is one of the core goals that kafka needs to achieve. For this reason, kafka has made the following designs:

1. Data disk persistence: messages are not cache in memory and are directly written to disk, making full use of the sequential read and write performance of disk

2.zero-copy: reduce the IO procedure

3. Batch sending of data

4. Data compression

5.Topic is divided into multiple partition to improve parallelism

2. Load balancing

1.producer sends the message to the specified partition according to the algorithm specified by the user

two。 There are multiple partiiton, each partition has its own replica, and each replica is distributed on different Broker nodes

3. Multiple partition needs to select lead partition,lead partition for reading and writing, and zookeeper is responsible for fail over

4. Manage the dynamic joining and leaving of broker and consumer through zookeeper

3. Pull system

Because kafka broker persists data and broker has no memory pressure, consumer is very suitable for pull consumption of data, with the following benefits:

1. Simplify kafka design

2.consumer independently controls the pulling speed of messages according to consumption capacity.

3.consumer chooses its own consumption mode according to its own conditions, such as bulk consumption, repeated consumption, consumption from the end, etc.

4. Scalability

When there is a need to add broker nodes, the new broker will register with zookeeper, and producer and consumer will perceive these changes according to the watcher registered on zookeeper and make timely adjustments.

Application scenarios of Kafka:

1. Message queue

Compared with most messaging systems, Kafka has better throughput, built-in partitioning, redundancy and fault tolerance, which makes Kafka a good solution for large-scale message processing applications. Messaging systems generally have relatively low throughput but require less end-to-end latency and rely on the powerful persistence guarantees provided by Kafka. In this area, Kafka is comparable to traditional messaging systems, such as ActiveMR or RabbitMQ.

2. Behavior tracking

Another application scenario of Kafka is to track users' browsing, searching, and other behaviors, which are recorded in the corresponding topic in a publish-subscribe mode in real time. After these results are obtained by the subscribers, they can be further processed in real time, or monitored in real time, or put into the hadoop/ offline data warehouse for processing.

3. Meta-information monitoring

It is used as a monitoring module for operation records, that is, to collect and record some operation information, which can be understood as data monitoring in the nature of operation and maintenance.

4. Log collection

In terms of log collection, there are actually many open source products, including Scribe and Apache Flume. Many people use Kafka instead of log aggregation (log aggregation). Log aggregation generally collects log files from the server and puts them in a centralized location (file server or HDFS) for processing. However, Kafka ignores the details of the file and abstracts it more clearly into a message flow of logs or events. This makes Kafka processing less latency and easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as Scribe or Flume, Kafka provides the same efficient performance and higher durability due to replication, as well as lower end-to-end latency.

5. Stream processing

This scene may be more numerous and easy to understand. Save the collected stream data to provide later docking Storm or other streaming computing framework for processing. Many users will periodically process, summarize, expand or otherwise convert the data from the original topic to the new topic before continuing the later processing. For example, the recommended processing flow of an article may be to grab the content of the article from the RSS data source and then throw it into a topic called "article". The subsequent operation may be to clean up the content, such as restoring normal data or deleting duplicate data, and finally return the content matching result to the user. This creates a series of real-time data processing processes in addition to a separate topic. Strom and Samza are well-known frameworks for implementing this type of data conversion.

6. Event source

An event source is an application design approach in which state transitions are recorded as a sequence of records sorted in chronological order. Kafka can store a large amount of log data, which makes it an excellent background for applications in this way. Such as dynamic summarization (News feed).

7. Persistent log (commit log)

Kafka can serve a distributed system with external persistence logs. This kind of log can back up data between nodes and provide a resynchronization mechanism for data recovery of failed nodes. The log compression feature in Kafka provides conditions for this use. In this usage, Kafka is similar to an Apache BookKeeper project.

The main design points of Kafka:

1. Use the cache of the linux file system directly to cache data efficiently.

2. Linux Zero-Copy is used to improve the transmission performance. The traditional data transmission needs to send 4 context switches. After using sendfile system call, the data is exchanged directly in the kernel state, and the system context switching is reduced to 2 times. According to the test results, data transmission performance can be improved by 60%. For detailed technical details of Zero-Copy, please refer to: https://www.ibm.com/developerworks/linux/library/j-zerocopy/

3. The cost of accessing data on disk is O (1). Kafka uses topic for message management. Each topic contains multiple part (ition), and each part corresponds to a logical log, which is composed of multiple segment. Multiple messages are stored in each segment (see figure below), and the message id is determined by its logical location, that is, the message id can be located directly to the storage location of the message, avoiding the additional mapping of id to location. Each part corresponds to an index in memory, recording the offset of the first message in each segment. The messages sent by the publisher to a topic will be evenly distributed to multiple part (randomly or according to the callback function specified by the user). The broker receives the release message to add the message to the last segment of the corresponding part. When the number of messages on a segment reaches the configuration value or the message publishing time exceeds the threshold, the message on the segment will be flush to disk, and only the message subscribers on the flush to disk can subscribe to it. When the segment reaches a certain size, no more data is written to the segment, and the broker creates a new segment.

4. Explicit distribution, that is, there will be multiple producer, broker and consumer, all of which are distributed. There is no load balancing mechanism between Producer and broker. Zookeeper is used for load balancing between broker and consumer. All broker and consumer are registered with zookeeper, and zookeeper saves some of their metadata information. If a broker and consumer changes, all other broker and consumer are notified.

This is the answer to the question about how to parse the distributed messaging system Kafka. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.