How to use Kafka in Spark Streaming to guarantee Zero data loss 04/28 Update SLTechnology News&Howtos

How to use Kafka in Spark Streaming to guarantee Zero data loss

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Spark Streaming how to use Kafka to ensure zero data loss, for this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Spark_streaming uses kafka to guarantee zero data loss

Spark streaming provides zero data loss since 1.2.To enjoy this feature, you need to meet the following conditions:

1. Data input requires reliable sources and reliable receivers

two。 To apply metadata, you must apply driver checkpoint.

3.WAL (write ahead log)

# # reliable sources and receivers

Spark streaming can be used as a data sources in a variety of ways (including kafka). Input data is received through receivers and stored in spark through replication (for faultolerance, it is copied to two spark executors by default). If the data is copied, receivers can know (for example, update offsets to zookeeper in kafka). In this way, when receivers crash in the process of receiving data, there will be no data loss, receivers does not have replicated data, and when the receiver is restored, it will be re-received.

# # metadata checkpoint

Reliable sources and receivers can make data recover after receivers failure, but recovery after driver failure is more complex. One way is to checkpoint metadata to HDFS or S3. Metadata includes:

Configuration

Code

Some RDD queued for processing but not finished (just metadata, not data)

So when driver fails, you can use metadata checkpoint to ReFactor the application and know where to execute it.

# # scenarios where data may be lost

Reliable sources and receivers, and metadata checkpoint cannot guarantee that data will not be lost, for example:

Two executor get the computed data and store it in their memory.

Receivers knows that the data has been entered

Executors starts calculating data

Driver suddenly failed.

If driver fails, executors will be dropped by kill.

Because executor is dropped by kill, all the data in their memory will be lost, but the data will no longer be processed.

Data in executor is unrecoverable

# # WAL

To avoid the above scenarios, spark streaming 1.2 introduces WAL. All received data is written to the checkpoint directory in HDFS or S3 through receivers, so that when driver fails and data is lost in executor, it can be recovered through checkpoint.

# # At-Least-Once although WAL can guarantee zero data loss, it cannot guarantee exactly-once, such as the following scenarios:

Receivers receives the data and saves it to HDFS or S3

Receivers failed before updating offset

Spark Streaming thought the data was received successfully, but Kafka thought the data was not received successfully because offset was not updated to zookeeper

Then receiver recovered.

Data that can be read from WAL is consumed again, because the kafka High-Level used consumes API, and consumption starts from the offsets saved in zookeeper

# # disadvantages of WAL as described above, WAL has two disadvantages:

Degrades the performance of receivers because data is also stored in distributed file systems such as HDFS

For some resources, there may be duplicate data, such as Kafka, one in Kafka and one in Spark Streaming (stored in WAL in hadoop API-compatible file systems)

# # Kafka direct API for the performance loss of WAL and the use of Kafka direct API in exactly-once,spark streaming1.3. Very cleverly, Spark driver calculates the offsets of the next batch and instructs executor to consume the corresponding topics and partitions. Consuming Kafka messages is like consuming file system files.

1. No longer need kafka receivers,executor to consume data directly through Kafka API

2.WAL is no longer needed. If you recover from failure, you can consume it again.

3.exactly-once is guaranteed not to read data repeatedly from WAL.

It is mainly said that spark streaming through a variety of ways to ensure that data is not lost, and to ensure exactly-once, each version of spark streaming is more and more stable, more and more to the production environment.

This is the answer to the question about how Spark Streaming uses Kafka to ensure zero data loss. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.