What is the integration of SparkStreaming and Kafka? 07/15 Update SLTechnology News&Howtos

What is the integration of SparkStreaming and Kafka?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the integration of SparkStreaming and Kafka. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Why is there an integration of SparkStreaming and Kafka?

First of all, we need to know why there is the integration of SparkStreaming and Kafka, everything does not happen for no reason!

We need to know that Spark as a real-time computing framework, it only involves computing, not data storage, so we need to use spark to connect external data sources later. As a sub-module of Spark, SparkStreaming has four types of data sources:

1.socket data source (used during testing)

2.HDFS data source (will be used, but not much)

3. Custom data source (not important, I haven't seen anyone else customize the data source)

4. Extended data sources (such as kafka data sources, which are very important and will be asked in the interview)

The following integration of SparkStreaming and Kafka, but only talk about the principle, the code will not be posted, there are too many on the Internet, write something you understand!

SparkStreaming integrates Kafka-0.8

The integration of SparkStreaming and Kafka depends on the version of Kafka. The first thing to talk about is the integration of SparkStreaming and Kafka-0.8.

In the SparkStreaming integration kafka-0.8, the easiest way to ensure that the data is not lost is to rely on the checkpoint mechanism, but there is a problem with the checkpoint mechanism. After upgrading the code, the checkpoint mechanism is invalid. So if you want to keep your data from being lost, you need to manage offset yourself.

Everyone will not feel strange to the code upgrade, Lao Liu explained it well!

We often encounter two situations in our daily development: when there is a problem with the code at first, change it, then repackage it and resubmit it; when the business logic changes, we also need to modify the code!

When we checkpoint persisted for the first time, the whole related jar will be serialized into a binary file, this is a unique value directory, if SparkStreaming wants to recover data through checkpoint, but if the code changes, even a little bit, you can't find the previously packaged directory, which will lead to data loss!

So we need to manage the offset ourselves!

Use ZooKeeper cluster to manage the offset, after the program starts, it will read the last offset. After reading the data, SparkStreaming will read the data from kafka according to the offset. After reading the data, the program will run. After running, the offset will be submitted to the ZooKeeper cluster, but there is a small problem. The program has failed, but the offset has not been submitted, and the result has partially reached HBase. When you re-read it, there will be data duplication, but only one batch will be affected. For big data, the impact is too small!

But there is a very serious problem, when there is a lot of consumer consumption data, the need to read offset, but ZooKeeper as a distributed coordination framework, it is not suitable for a large number of read and write operations, especially write operations. Therefore, highly concurrent request ZooKeeper is not suitable, it can only be used as lightweight metadata storage, can not be responsible for high concurrency read and write as data storage.

According to the above, it leads to the integration of SparkStreaming and Kafka-1.0.

SparkStreaming integrates Kafka-1.0

This is just a scheme designed by KafkaOffsetmonitor, which uses it to monitor tasks, then uses crawler technology to obtain monitoring information, then imports the data into openfalcon, configures alarms or develops alarm systems according to policies in openfalcon, and finally sends the information to developers using WeCom or SMS!

The above is what the integration of SparkStreaming and Kafka is like. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.