In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how to achieve high concurrency of hundreds of thousands of writes in Kafka. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
The beginning
At present, there are many popular MQ, because our company chose to use Kafka in technology selection, so I compiled an introduction to Kafka. Through the technology selection, we compare the mainstream MQ in the industry, and the biggest advantage of Kakfa is its high throughput.
Kafka is a high concurrency and high performance message middleware with high throughput and low latency, which is widely used in the field of big data. A well-configured Kafka cluster can even achieve hundreds of thousands or millions of concurrent writes per second.
So how does Kafka achieve such high throughput and performance? After getting started, let's take an in-depth look at the architectural design principles of Kafka. Mastering these principles will have an advantage in Internet interviews.
Persistence
Kafka relies on the file system for message storage and caching, and every time it receives data, it writes to disk. The general impression of "disk speed is slow" makes people doubt that persistent architecture can provide strong performance.
In fact, disks are much slower and faster than expected, depending on how people use them. And a well-designed disk structure can usually be as fast as a network.
Through the comparison of the above figure, we can see that sequential disk access is actually faster than random memory access in some cases. In fact, Kafka takes advantage of this advantage to achieve high-performance disk writing.
See: http://kafka.apachecn.org/documentation.html#persistence
Page caching technology + disk sequential write
Kafka in order to ensure disk write performance, first of all, Kafka is based on the operating system's page cache to achieve file writing.
The operating system itself has a layer of cache called page cache, which is a cache in memory. We can also call it os cache, which means the cache managed by the operating system itself.
When you write a disk file, you can write it directly to os cache, that is, just write it to memory, and then it's up to the operating system to decide when to actually flush the data from os cache to disk.
Through the above figure, the write performance of disk files can be greatly improved. In fact, this method is equivalent to writing memory, not writing to disk.
Write disk sequentially
It is also critical that Kafka writes data in disk order, that is, it only appends the data to the end of the file (append), rather than modifying the data at random locations in the file.
For ordinary mechanical hard drives, if you write at random, the performance is indeed very low, which involves the problem of disk addressing. But if you just append the end of the file to write data sequentially, then the performance of this disk sequential write is basically the same as the performance of write memory itself.
To sum up: Kafka is based on page cache technology + disk sequential write technology to achieve ultra-high performance of writing data.
Therefore, the core point of writing tens of thousands or even hundreds of thousands of data per second is to improve the performance of each data write as much as possible, so that more data can be written per unit time and throughput can be improved.
Zero copy Technology (zero-copy)
When you're done, write this piece, and then talk about consumption.
As you all know, we often consume data from Kafka, so when we consume data, we actually need to read a certain piece of data from the disk file of kafka and send it to downstream consumers, as shown below:
If Kafka reads data from disk and sends it to downstream consumers in the above way, the approximate process is:
Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community
First check whether the data to be read is in the os cache. If not, read the data from the disk file and put it into os cache.
Then copy the data from the os cache of the operating system to the cache of the application process, and then copy the data from the cache of the application process to the Socket cache at the operating system level. Finally, the data is extracted from the Soket cache and sent to the network card, and finally sent to the downstream consumers.
The whole process is shown below:
As can be seen from the picture above, there are two unnecessary copies of the whole process.
Once it is copied from the cache of the operating system into the cache of the application process, and then back to the Socket cache of the operating system from the application cache.
And in order to make these two copies, there are several context switches in between, one is the application is executing, the other is the context switching to the operating system.
So reading data in this way is more performance-consuming.
In order to solve this problem, Kafka introduces zero-copy technology when reading data.
In other words, the data in the cache of the operating system is directly sent to the network card and sent to the downstream consumers, and the step of copying the data twice is skipped. Only a descriptor will be copied in the Socket cache, not the data to the Socket cache.
Experience this exquisite process.
With zero copy technology, there is no need to copy the data in os cache to the application cache, and then from the application cache to the Socket cache, both copies are omitted, so it is called zero copy.
The Socket cache is simply a descriptor for copying the data, and then the data is sent directly from the os cache to the network card, which greatly improves the performance of reading file data during data consumption.
And you will notice that when reading data from disk, you will first look at whether there is any in os cache memory, if so, actually read the data directly from memory.
If the kafka cluster is well tuned, you will find that a large amount of data is written directly to the os cache, and then the data is also read from os cache.
It is equivalent to that Kafka provides data writing and reading based entirely on memory, so the overall performance will be extremely high.
Noun interpretation
TPS: throughput refers to the amount of data successfully transmitted per unit of time (measured in bits, bytes, packets, etc.) to a network, device, port, virtual circuit, or other facility
On how Kafka is to achieve high concurrency of hundreds of thousands of writes to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.