Which basic concepts should we start with in learning Kafka? 07/03 Update SLTechnology News&Howtos

Which basic concepts should we start with in learning Kafka?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

To learn which basic concepts Kafka should start with, this article introduces in detail the corresponding analysis and solutions to this problem, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

Kafka creates a background

Kafka is a messaging system originally developed from LinkedIn and used as the basis for LinkedIn's activity flow (Activity Stream) and operational data processing pipeline (Pipeline). It is now used by many different types of companies as multiple types of data pipelines and messaging systems.

Activity stream data is the most common part of the data that almost all sites use when reporting the usage of their websites. Activity data includes page visits (Page View), information about the content being viewed, and search information. The usual way to deal with this kind of data is to write various activities to a certain file in the form of a log, and then analyze these files periodically. Operational data refers to the server's performance data (CPU, IO usage, request time, service log, and so on). There are a variety of statistical methods for operational data.

In recent years, activity and operational data processing has become a critical part of the website's software product features, which requires a slightly more complex infrastructure to support it.

Introduction to Kafka

Kafka is a distributed, publish / subscribe based messaging system. The main design objectives are as follows:

The ability of message persistence is provided in the way of time complexity O (1), which can guarantee the access performance of constant time complexity even for data above TB level.

High throughput. Even on very cheap commercial machines, it is possible to support the transmission of more than 100K messages per second on a single machine.

Support message partitioning between Kafka Server, and distributed consumption, while ensuring the sequential transmission of messages within each Partition.

Both offline data processing and real-time data processing are supported.

Scale out: supports online horizontal scaling.

Basic concepts of Kafka

Concept 1: producers and consumers

There are two basic types of clients for Kafka: producers (Producer) and consumers (Consumer). In addition, there are high-level clients such as Kafka Connect API and streaming Kafka Streams used for data integration, but the bottom layer of these high-level clients is still producer and consumer API, and they are only encapsulated on the upper layer.

It is easy to understand that producers (also known as publishers) create messages, while consumers (also known as subscribers) are responsible for consuming or to read messages.

Concept 2: theme (Topic) and partition (Partition)

In Kafka, messages are classified by Topic, and each topic corresponds to a "message queue", which is somewhat similar to a table in a database. However, if we cram all similar messages into a "central" queue, there is bound to be a lack of scalability, whether the increase in the number of producers / consumers or the increase in the number of messages, may deplete the performance or storage of the system.

We use a living example to illustrate that now some goods produced by city A need to be transported to city B and take the highway. then the single-aisle highway will have the problem of "insufficient throughput" in the case of "more goods in city A" or "now city C also has to transport things to city B". So we now introduce the concept of Partition, which is a horizontal extension of our theme in a way similar to "allow more roads to be built".

Concept 3: Broker and Cluster

A Kafka server, also known as Broker, accepts messages sent by producers and stores them to disk; Broker also serves consumers' requests to pull partition messages and returns messages that have been submitted so far. Using specific machine hardware, an Broker can process thousands of partitions and millions of messages per second. Now it is on the order of millions.. I went to check it out, and it seems that the throughput is quite high in the case of cluster.. Please..)

Several Broker form a Cluster, in which a Broker in the cluster becomes a cluster controller (Cluster Controller), which is responsible for managing the cluster, including assigning partitions to Broker, monitoring Broker failures, and so on. In a cluster, a Broker is responsible for a partition, and this Broker is also known as the Leader; of this partition. Of course, a partition can be copied to multiple Broker to achieve redundancy, so that when there is a Broker failure, its partition can be reassigned to another Broker. The following figure is an example:

One of the key properties of Kafka is log retention (retention). We can configure message retention policies for topics, such as keeping logs for only a period of time or keeping logs of a specific size. When these limits are exceeded, the old message is deleted. We can also set the message expiration policy separately for a topic, so that it can be personalized for different applications.

Concept 4: multi-cluster

As our business grows, we often need multiple clusters, usually for the following reasons:

Data-based isolation

Security-based isolation

Multiple data centers (disaster recovery)

When building multiple data centers, it is often necessary to achieve message exchange. For example, if the user modifies the profile, the update needs to be reflected no matter which data center the subsequent request is processed by. Or, data from multiple data centers need to be aggregated into a master control center for data analysis.

The partition replication redundancy mechanism mentioned above only applies within the same Kafka cluster, and the MirrorMaker tool provided by Kafka can be used for message synchronization in multiple Kafka clusters. In essence, MirrorMaker is just a Kafka consumer and producer connected by a queue. It consumes messages from one cluster and then produces messages to another cluster.

The answers to the questions about which basic concepts to start with in learning Kafka are shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.