What is the method of handling Exactly once transactions 07/01 Update SLTechnology News&Howtos

What is the method of handling Exactly once transactions

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what is the method of handling Exactly once transactions". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1Acctly once transaction

What is it, Exactly once business?

The data is processed only once and output only once, which is the complete transaction.

Spark cannot guarantee that the output is also transactional when there is a run error. An error occurred in the middle of the Task execution. Although the semantic transaction is done and the data is processed only once, if it is output to the database, the results can be saved to the database many times. Spark retries when the task fails, which causes the results to be saved to the database multiple times.

As shown in the following figure, when the Receiver running on Executor receives data and writes it to memory and disk through BlockManager, or writes and logs through the WAL mechanism, and then reports the metedata information to Driver. Checkpoint operations are performed periodically on the driver side. The execution of Job is also based on the scheduling mode of Spark Core to execute Task on Executor.

Processing of Exactly once transactions:

1, zero data loss: there must be reliable data sources and reliable Receiver, and the metadata of the whole application must be checkpoint, and data security must be ensured through WAL.

We take the data from Kafka as an example. When Receiver running on Executor receives data from Kafka, it will send ACK confirmation message to Kafka and read the next message, and kafka will updateOffset to record the offset received by Receiver, which ensures zero data loss in Executor.

On the driver side, the checkpoint operation is performed periodically, the data is read in from the Checkpoint file system in case of error to recover, the StreamingContext is rebuilt (that is, the SparkContext is built) and started, the metadata metedata is restored, the RDD is generated again, the last Job is restored, and then submitted to the cluster for execution again.

So what are the places where the data may be lost and the corresponding solutions?

When the Receiver receives the data and starts to calculate the data through the Driver's scheduling Executor, if the Driver suddenly collapses, the Executor will be killed, and the data in the Executor will be lost (if no WAL operation is performed).

Solution: at this point, it is necessary to use WAL, for example, to make all the data secure and fault-tolerant first through HDFS. At this point, if the data in Executor is lost, it can be recovered through WAL.

The disadvantage of this method is that the performance of Receivers receiving data in SparkStreaming will be greatly damaged by WAL.

Repeated read of data:

When the Receiver receives the data and saves it to the persistence engine such as HDFS, there is no time to updateOffsets (take Kafka as an example). After the Receiver crashes, restarting will read the data again through the metadata in the Zookeeper that manages the Kafka. But at this time, SparkStreaming thinks it is successful, but kafka thinks it failed (because the offset has not been updated to ZooKeeper), which will lead to data re-consumption.

Solution: in the ZooKeeper-based way of Receiver, when reading data, access the metadata information of Kafka, when processing code such as foreachRDD or transform, write the information to the in-memory database (memorySet), read the in-memory database information during calculation to determine whether it has been processed, and skip calculation if it has been processed. This metadata information can be saved to in-memory data structures or memsql,sqllite.

If you use Kafka as the data source, there is data in Kafka, and then there will be a copy of the data when Receiver receives it, which is actually a waste of storage resources.

In order to avoid the performance loss of WAL and implement Exactly Once, Spark provided Kafka Direct API and used Kafka as a file storage system. At this time, it has both the advantages of streaming and the advantages of file system, so Spark Streaming+Kafka has built a perfect streaming world (1, data does not need to be copied; 2, there is no need for performance loss caused by WAL; 3Performance Kafka uses ZeroCopy more efficiently than HDFS). All Executors directly message data through Kafka API and directly manage Offset, so there is no duplicate consumption of data.

2, the output is not duplicated

About multiple rewrites of Spark Streaming data output and its solution:

1, why there is this problem, because Spark Streaming will naturally do the following things based on Spark Core when calculating, resulting in repeated output of Spark Streaming results (in part). Task retry, slow task guess, Stage retry, Job retry.

2. Specific solutions:

Set the number of spark.task.maxFailures to 1 so that there are no Task retries. If you set spark.speculation to off, there will be no slow task speculation, because slow task speculation consumes a lot of performance, so turning off can significantly improve Spark Streaming processing performance.

For Spark Streaming On Kafka, if Job fails, you can set the parameter auto.offset.reset of Kafka to largest mode.

Finally, it is emphasized again that transform and foreachRDD can be used for logic control based on business logic code to achieve non-repetitive data consumption and output. These two methods are similar to the back door of Spark Streaming and can do any imaginary control operation.

This is the end of the content of "how to handle Exactly once transactions". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.