Lesson 4: complete mastery of transaction handling and non-repetitive output of Spark Streaming's Exactly-Once 07/03 Update SLTechnology News&Howtos

Lesson 4: complete mastery of transaction handling and non-repetitive output of Spark Streaming's Exactly-Once

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Pre-knowledge:

1. Characteristics of the transaction: 1), processed and only processed once; 2), output and output only once

2. Is it possible for SparkStreaming to completely fail in transaction processing?

This is unlikely because Spark processes streams in batches, and relevant resources have been allocated to the application when the SparkStreaming application is started, and resources can be allocated dynamically during the scheduling process, so unless all the hardware of the entire cluster collapses, it will generally be processed.

3. SparkStreaming writes programs based on Driver and Executor.

SparkStreaming architecture process:

1. SparkStreaming basic architecture process:

1), Receiver (continuously) reports the metadata to Driver,2 after receiving the data, Driver will perform CheckPoint,3 for the security of the data after receiving the data, Job execution (in Executor): scheduling mode based entirely on SparkCore

SparkStreaming basic architecture flowchart:

WAL (write ahead log) mechanism: when writing data, it is first written to the file system through the WAL mechanism, and then stored in Executor,Executor to disk or memory (this is according to the setting of StorageLevel). If the previous write is not successful, it will not be stored in Executor, and if it is not stored in Executor, it will not be reported to Driver and the data will not be processed.

Only when the data received by Receiver reaches a certain level will the data be stored in memory or disk. When the data is not accumulated to a certain extent, Executor or Receiver will collapse, and the data will be lost a little.

SparkStreaming:1, get data; 2. Generate jobs, which must be executed through SparkContext

The process of data recovery when a crash occurs:

1) the Driver-level recovery is to read the data directly from the file system of Driver checkpoint, while internally, restart SparkContext (and SparkContext), restore the metadata and generate RDD again (recovery is based on the previous job), and submit it to the cluster

2) the recovery of Receiver is to receive the data on the basis of the previous data, and the received data will also be recovered from the disk through the WAL mechanism

Transaction processing for Exactly Once:

1) Zero data loss: there must be reliable data sources and reliable Receiver, and the metadata of the entire application must be CheckPoint, and data security must be ensured through WAL (let's take the data from Kafka as an example. When the Receiver running on Executor receives the data from Kafka, it will send ACK confirmation to Kafka and read the next message, and kafka will updateOffset to record the offset received by Receiver, which ensures zero data loss in Executor.)

2) Spark provides Kafka Direct API as a file storage system in order to avoid the performance loss of WAL and implement Exactly Once. At this time, it has both the advantages of streaming and the advantages of file system, so Spark Streaming+Kafka has built a perfect streaming world (1, data does not need to be copied and copied; 2, there is no need for performance loss caused by WAL; 3Performance Kafka uses ZeroCopy more efficiently than HDFS). All Executors directly message data through Kafka API and directly manage Offset, so they do not repeat consumption data.

Data loss and its specific solutions:

When the Receiver receives the data and starts to calculate the data through the Driver's scheduling Executor, if the Driver suddenly collapses, the Executor will be dropped by kill, and the data in the Executor will be lost (if no WAL operation is performed).

Solution: at this point, it is necessary to use WAL, for example, to make all the data secure and fault-tolerant first through HDFS. At this point, if the data in Executor is lost, it can be recovered through WAL (the disadvantage of this method is that WAL will greatly damage the performance of Receivers receiving data in SparkStreaming)

Repeated read of data:

In the case based on Kafka, Receiver receives the data and saves it to the persistence engine such as HDFS, but there is no time to updateOffsets. After the Receiver crashes, restart will read the data again through the metadata in the ZooKeeper that manages the Kafka. But at this time, SparkStreaming thinks it is successful, but Kafka thinks it failed (because the offset has not been updated to ZooKeeper), which will lead to data re-consumption.

Solution: in the ZooKeeper-based way of Receiver, when reading data, access the metadata information of Kafka, when processing code such as foreachRDD or transform, write the information to the in-memory database (memorySet), read the in-memory database information during calculation to determine whether it has been processed, and skip calculation if it has been processed. This metadata information can be stored in in-memory data structures or memsql,sqllite (if you use Kafka as the data source, there is data in Kafka, and then there will be a copy of the data when Receiver receives it, which is actually a waste of storage resources)

Data output is rewritten multiple times

Why is this a problem, because Spark Streaming naturally does the following things based on Spark Core,Spark Core when calculating (for example, after data output, an error occurs in the subsequent program of the Task and an error occurs in the task, and Spark Core enters the following program):

Task retry; slow task guess (two identical tasks may be executed at the same time), Stage repeat; Job retry

Solution:

Set the number of spark.task.maxFailures to 1

Set spark.speculation to off (because slow tasks are supposed to consume a lot of performance, turning off can significantly improve Spark Streaming processing performance)

If Spark Streaming on Kafka, you can set auto.offset.reset to "largest" if Job fails.

Finally, it is emphasized again that transform and foreachRDD can be used for logic control based on business logic code to achieve non-repetitive data consumption and output. These two ways are similar to the back door of Spark Streaming, and you can do any imaginary control operation!

Note:

1. DT big data DreamWorks Wechat official account DT_Spark

2. IMF 8: 00 p.m. Big data actual combat YY live broadcast channel number: 68917580

3. Sina Weibo: http://www.weibo.com/ilovepains

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.