What are the characteristics of Kafka 04/28 Update SLTechnology News&Howtos

What are the characteristics of Kafka

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this article, the editor introduces in detail "what are the characteristics of Kafka", the content is detailed, the steps are clear, and the details are handled properly. I hope that this article "what are the characteristics of Kafka" can help you solve your doubts.

What is Kafka?

Kafka is a cluster of message middleware + storage, a node can store several terabytes of data!

Why does a middleware need to store data?

It turns out that for Internet companies like Linkin, there are three kinds of data generated by users and websites:

For the transaction data that needs real-time response, the user submits a form and enters a piece of content. This kind of data is finally stored in the relational database (Oracle, MySQL), some of which need transaction support.

Activity stream data, quasi-real-time, such as page visits, user behavior, search situation, what can these data produce? Broadcast, sorting, personalized recommendation, operation monitoring, etc. This kind of data is generally the front-end server to write the file first, and then pour the file into the big data analyzer like Hadoop in a batch way.

Logs generated by programs at all levels, such as httpd logs, tomcat logs, and other logs generated by various programs. For programmers, this kind of data is used for monitoring and alarm, and for analysis.

The power of Linkin is that they found that there was something wrong with the original data processing method of 2Jing 3. For 2, the original method of batch processing every one or two hours is no longer good, and users had better see the relevant recommendations immediately after a purchase. For 3, the traditional syslog mode is not easy to use, and in many cases 2 and 3 use the same batch of data, but the data consumers are different.

The characteristics of these two kinds of data are:

In real time, you don't need a second response, just the minute level.

The amount of data is huge, which is more than 10 times the transaction data.

There are many data consumers, such as rating, voting, sorting, personalized recommendation, security, operational monitoring, program monitoring, post-reporting, etc.

So Linkin developed a system to deal with data of this nature, which is called Kafka.

So, what kind of design did Linkin do and what problems did it solve in the whole practice?

First take a look at the data flow diagram:

How multiple data centers manage data:

Architecture diagram of the cluster itself

Kafka internal architecture diagram, divided into data producer (Producer), data intermediary (Broker), data consumer (Consumer)

Obviously, this is a clustered publish / subscribe system with the following characteristics

Producers push data (Push) and consumers pull data (Pull). There is data reuse, and on average, a message produced in Linkin will be consumed 5.5 times.

The speed of data producers and data consumers is not equal, so the data should be deposited in the Kafka and processed slowly. Linkin usually keeps the data in the cluster for 7 days.

High throughput is pursued in performance to ensure a certain delay. A lot of optimizations have been made in this area, including no global hash, batch delivery, cross-data center compression, and so on.

The semantics of "at least one transmission" used in fault tolerance. It is not guaranteed to be strong once, but to avoid passing at most one time.

Data partitioning in a cluster ensures that a single data consumer can read all the data of a sub-topic (such as a user's data) of a topic (topic) to avoid reading the data globally.

Data standardization, all the data is divided into hundreds of topics, and then Schema is used to regulate the data in the source of the data-- the Producer. This concept makes the later data transmission, serialization, compression and consumption have a unified standard, but also solves the very troublesome problem of data version incompatibility in this field-as soon as the producer changes the code, the consumer goes blind.

For monitoring, the power of this system is that the data flow of all the previous production systems can be related through this system, whether for daily operation, for data audit, or for operation and maintenance level monitoring.

To be continued...

Therefore, the design of Kafka is basically the only option in this field at present. I have also seen many other implementations, including:

Data acquisition module

Data transmission component

Real-time data calculation / indexing / search component

Data storage / persistence component

Data display / query / alarm interface component

From the design concept of data transmission, Kafka is the most advanced.

Among the current implementations, I guess Splunk is the only one who can fight Kafka.

After reading this, the article "what are the characteristics of Kafka" has been introduced. If you want to master the knowledge points of this article, you still need to practice and use it yourself to understand it. If you want to know more about related articles, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.