The way to further study of kafka (2) introduction of kafka and explanation of technical terms 04/06 Update SLTechnology News&Howtos

The way to further study of kafka (2) introduction of kafka and explanation of technical terms

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Table of contents:

1. Brief introduction to kafka. What is kafka? What is the design goal?

2. Advantages and disadvantages of kafka.

3. Explanation of technical terms in kafka

Official website: http://kafka.apache.org/intro

Kafka Chinese course http://orchome.com/kafka/index

1 / kafka introduction

Kafka, originally developed by Linkedin, is a distributed, partitioned, multi-replica, multi-subscriber, distributed log system based on zookeeper coordination (which can also be used as a MQ system), which can be used for web/nginx logs, access logs, message services, etc., Linkedin contributed to the Apache Foundation in 2010 and became a top open source project. / / the project that has been developed for nearly a decade is now very mature.

The main application scenarios are: log collection system and message system.

The main design objectives of Kafka are as follows:

The main results are as follows: (1) the ability of message persistence is provided in the way of time complexity O (1), which can guarantee the access performance of constant time even for data above TB level.

(2) high throughput. Even on very cheap commercial machines, it is possible to support the transmission of 100K messages per second on a single machine.

(3) support message partitioning between Kafka Server and distributed consumption, while ensuring the sequential transmission of messages within each partition.

(4) support both offline data processing and real-time data processing.

(5) Scale out: supports online horizontal expansion

Kafka is a publish-subscribe model. In a publish-subscribe messaging system, messages are persisted into a topic. Unlike peer-to-peer messaging systems, consumers can subscribe to one or more topic, consumers can consume all the data in the topic, the same data can be consumed by multiple consumers, and the data will not be deleted immediately after consumption. In the publish-subscribe message system, the producer of the message is called the publisher and the consumer is called the subscriber. / / messages sent by publishers to topic will only be received by subscribers who subscribe to topic.

Why use kafka?

1. As a cache

2. Decoupling

3. Time less than 10ms is basically a real-time

It can simplify the design of our system and prompt the development speed and efficiency of the company.

Advantages of 2/kafka

1. Decoupling

/ / S system is closely coupled with A, B and C systems. Due to the change of demand, system A modifies the relevant code, and system S also needs to adjust the code related to A. In a few days, the C system needs to be deleted, and S will immediately delete the C-related code; after a few days, the D system will need to be added, and the S system will have to add D-related code; in a few days, programmers will be crazy that each system is tightly coupled, which is not conducive to maintenance and expansion. Now the introduction of MQ,A system changes, A can modify their own code; C system delete, directly unsubscribe; D system add, subscribe to related messages. In this way, through the introduction of message middleware, each system can interact with MQ, thus avoiding the complicated calling relationship between them.

2. Redundancy (copy)

/ / in some cases, the process of processing data will fail. Unless the data is persisted, it will be lost. Message queues persist data until they have been fully processed, avoiding the risk of data loss. In the insert-get-delete paradigm used by many message queues, before deleting a message from the queue, your processing system needs to clearly indicate that the message has been processed to ensure that your data is safely saved until you have finished using it.

3. Scalability

/ / because message queuing decouples your processing, it is easy to increase the frequency of queuing and processing of messages, as long as you add additional processing. There is no need to change the code or adjust the parameters. Expanding is as simple as turning up the power button.

4. Flexibility & peak processing capacity (peak cutting)

/ / the processing capacity of the database is limited, during the peak period, too many requests fall into the background, once the processing capacity of the system is exceeded, the system may fail. The processing capacity of the system is 2kUniverse MQ is 8k/s, and the processing capacity of peak request 5kUniverse MQ is much larger than that of the database. During the peak period, requests can be overstocked in MQ first, and the system can consume these requests at the speed of 2k/s according to its own processing capacity. As soon as the peak is over, the request may only be 100 seconds, and the system can quickly consume the backlog of requests in the MQ.

5. Recoverability

/ / when some components of the system fail, the whole system will not be affected. Message queuing reduces the coupling between processes, so even if a process that processes messages dies, messages added to the queue can still be processed after the system is restored.

6. Sequence guarantee

/ / in most usage scenarios, the order of data processing is important. Most message queues are sorted by nature, and it is guaranteed that the data will be processed in a specific order. Kafka guarantees the ordering of messages within a Partition.

7. Buffer

/ / in any important system, there will be elements that require different processing times. For example, loading an image takes less time than applying a filter. Message queuing uses a buffer layer to help tasks perform as efficiently as possible-the processing of writing to the queue is as fast as possible. This buffer helps to control and optimize the speed at which data flows through the system. MQ acts as a buffer between the user request and the database, which is reflected in peak clipping.

8. Asynchronous communication

/ / in many cases, users do not want or need to process messages immediately. Message queuing provides an asynchronous processing mechanism that allows users to put a message on the queue, but does not process it immediately. Put as many messages as you want in the queue, and then process them when needed.

Summary of advantages:

1. Stand-alone throughput:

Level 100000, which is the biggest advantage of kafka, that is, its high throughput. It generally cooperates with big data systems to implement data calculation, log collection and other scenarios.

2. The impact of topic data on throughput:

Topic ranges from dozens to hundreds, but the more topic, it will greatly affect the throughput, so under the same machine, kafka throughput ensures that the number of topic is not excessive. More cluster resources are needed to support large-scale topic.

3. Timeliness:

Delay is controlled within ms

4. Availability:

Very high, kafka is distributed Yes, multiple copies of a data, a small number of machine downtime, will not lose data, will not cause unavailability

5. Message reliability

After parameter optimization, the message can be lost at zero.

6. Functional support

The function is relatively simple and mainly supports simple MQ functions. Real-time computing and log collection in big data's field are used on a large scale, which is the de facto standard.

Summary of advantages and disadvantages

In fact, the characteristic of kafka is very obvious, that is, it only provides less core functions, but provides higher throughput, ms-level latency, high availability and reliability, and is distributed and can be expanded arbitrarily. At the same time, kafka is also good at supporting a small number of topic to ensure its throughput. And the only disadvantage of kafka is that there may be repeated consumption of messages (see later blog content), which will have an impact on data accuracy, which is slightly negligible in big data's field and log collection. The characteristic of kafka is that it is naturally suitable for big data's real-time computing and log collection.

Explanation of technical terms in 3/kafka (related concepts)

Before understanding kafka in depth, it is necessary to understand and understand the relevant terminology concepts that will appear in kafka:

1. Servers contained in the Broker:Kafka cluster.

/ / A server node Kafka cluster where kafka is installed contains one or more servers, and the server node is called broker.

Broker stores data for topic. If a topic has N partition and the cluster has N broker, then each broker stores one partition of that topic.

If a topic has N partition and the cluster has N broker, then N broker stores one partition of the topic, and the remaining M broker does not store the partition data of the topic.

If a topic has N partition and the number of broker in the cluster is less than N, then one broker stores one or more partition of that topic. In the actual production environment, try to avoid this situation, which can easily lead to Kafka cluster data imbalance.

For example, if we have 5 broker nodes, then try to create a multiple of 5 partition partitions of topic 10 20 30. Even 50% of the data can be evenly distributed in kafka.

2. Producer: message producer.

/ / the producer is the publisher of the data, and this role publishes messages to the topic of Kafka. After the broker receives the message sent by the producer, broker appends the message to the segment file currently used to append the data. The messages sent by the producer are stored in a partition, and the producer can also specify the partition of the data store.

3. Consumer: message consumer.

/ / consumers can read data from broker. Consumers can consume data from multiple topic.

4. Consumer Group: each Consumer belongs to a Consumer Group, and each message can only be consumed by one Consumer in the Consumer Group, but can be consumed by multiple Consumer Group.

/ / each Consumer belongs to a specific Consumer Group (you can specify a group name for each Consumer, or it belongs to the default group if you do not specify group name).

5. Topic: the type of message. Each message belongs to a Topic, and different Topic are independent of each other, that is, the Kafka is Topic-oriented.

/ / every message posted to the Kafka cluster has a category, which is called Topic. (physically different Topic messages are stored separately, logically, a Topic message is stored on one or more broker, but users only need to specify the Topic of the message to produce or consume data without caring about where the data is stored.) similar to the table name of the database

6. Partition: each Topic is divided into multiple Partition,Partition, which is the unit assigned by Kafka. The physical concept of Kafka is equivalent to a directory in which the log files make up the Partition.

/ / data in topic is split into one or more partition. Each topic has at least one partition. The data in each partition is stored in multiple segment files. The data in partition is ordered, and the data between different partition loses the order of the data. If the topic has multiple partition, the order of the data cannot be guaranteed when consuming the data. In scenarios where the consumption order of messages needs to be strictly guaranteed, the number of partition needs to be set to 1.

7. A copy of Replica:Partition to ensure the high availability of Partition.

/ / each topic will be divided into multiple partition (zones). In addition, kafka can also configure the number of backups required by partitions (replicas).

8. A character in Leader:Replica, Producer and Consumer only interact with Leader.

/ / there are multiple copies of each partition, of which one and only one is the partition that is currently responsible for reading and writing data as Leader,Leader.

9. A role in Follower:Replica that copies data from Leader.

/ / Follower follows Leader, all write requests are routed through Leader, and data changes are broadcast to all Follower,Follower to keep data synchronized with Leader. If the Leader fails, a new Leader is elected from the Follower. When Follower and Leader hang up, get stuck, or synchronize too slowly, leader removes the follower from the "in sync replicas" (ISR) list and creates a new Follower.

Based on the replicated scheme, it means that multiple backups need to be scheduled; each partition has a server of "leader"; leader is responsible for all read and write operations, and if leader fails, other follower will take over (become the new leader); follower just monotonously follow up with leader and synchronize messages.. Thus it can be seen that server as a leader carries all the request pressure, so from the perspective of the cluster as a whole, how many partitions means how many "leader" there are. Kafka will evenly distribute the "leader" on each instance to ensure the stability of the overall performance.

Where the location of partition leader (host:port) is registered in zookeeper

10. Controller (regulator): one of the servers in the Kafka cluster, used for Leader Election (leader election) and various Failover (failover).

/ / Controller (regulator) nodes are fragmented on which broker node is recorded in zk and when the controller was created / / View method get / kafkagroup/controller

11. Zookeeper:Kafka stores the Meta information of the cluster through Zookeeper. (question: which metadata information about kafka is in zk, see later blog)

12. Coordinator: similar to selecting a Controller from Broker, consumers also need to select a Coordinator from Broker to allocate Partition.

Reference link:

Kafka basic principles, important concepts, advantages and disadvantages of https://blog.51cto.com/12445535/2353399

The road to architecture growth: Kafka design principles read and forget, forget and see? One article gives you a good command of https://www.toutiao.com/i6714606866355192328/

The way to Kafka Learning (1) A brief introduction to Kafka https://www.cnblogs.com/qingyunzong/p/9004509.html

Message queuing kafka feature https://blog.csdn.net/qq_36236890/article/details/81174504

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.