(version customization) lesson 4: complete details of Spark Streaming transactions 07/16 Update SLTechnology News&Howtos

(version customization) lesson 4: complete details of Spark Streaming transactions

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

This article is mainly carried out from two aspects:

1. Exactly Once

Second, the output is not repeated.

Transaction:

Bank transfer as an example, A user transfers money to B user, B user may receive multiple amounts of money, how to ensure the consistency of the transaction, that is to say, the transaction output can be output and will only be output once, that is, A will only transfer once and B will only receive once.

Decrypt the SparkStreaming schema from a transaction perspective:

When the SparkStreaming application starts, resources are allocated, and unless the hardware resources of the entire cluster collapse, there will generally be no problem. The SparkStreaming program is divided into parts, one is Driver, the other is Executor. After receiving the data, Receiver continuously sends the metadata to Driver,Driver and carries on the CheckPoint processing after receiving the metadata information. CheckPoint includes: Configuration (including Spark Conf, Spark Streaming and other configuration information), Block MetaData, DStreamGraph, unprocessed and waiting Job. Of course, Receiver can execute Job,Job on multiple Executor nodes based entirely on the SparkCore scheduling mode.

Executor only functions to deal with logic and data, external InputStream flows into Receiver through BlockManager to write disk, memory, WAL for fault tolerance. WAL is written to disk first and then to Executor, which is unlikely to fail. If 1G data needs to be processed and Executor receives one by one, the Receiver received data is accumulated to a certain record before it is written to the WAL. If the Receiver thread fails, the data may be lost.

Before Driver processes metadata, CheckPoint,SparkStreaming will be used to obtain data and generate jobs, but it does not solve the problem of execution, so execution must go through SparkContext. Dirver-level data repair requires reading metadata from Driver CheckPoint. SparkContext, StreamingContext, and SparkJob are rebuilt internally, and then submitted to the Spark cluster for operation. When Receiver is restored, it will be recovered from disk through WAL.

The combination of SparkStreaming and Kafka will not cause the problem of WAL data loss, and SparkStreaming must consider the way of external pipelining.

How can complete semantics, transaction consistency, ensure zero data loss, Exactly Once transaction processing:

How to ensure zero data loss?

You must have a reliable data source and a reliable Receiver. The MetaData of the entire application must be CheckPoint, and the data security must be ensured through WAL. (when Receiver receives data from Kafka in production environment, by default, there are two copies of data in Executor, and by default, two copies of data must be backed up before calculation. If Receiver fails to receive data and there is no copy of Copy, Copy,Copy from Kafka will be based on zookeeper metadata.

You can think of Kafka as a simple file storage system. In Executor, Receiver determines that each record of Kafka has been received, and after Replication to other Executor is successful, it will send a confirmation message to Kafka through ack and continue to read the next message from Kafka.

Think again about where data may be lost.

The main scenarios for data loss are as follows:

When Receiver receives data and is scheduled by Driver, Executor starts to calculate data. If Driver suddenly collapses (causing Executor to be dropped by Kill), Executor will be dropped by Kill, then data in Executor will be lost. For example, WAL mechanism must be used to allow all data to be securely fault-tolerant in a way similar to HDFS, so as to solve the problem that data loss caused by Kill loss of Executor can be recovered through WAL mechanism.

Here are two important scenarios to consider:

How can data be processed once and only once?

Zero data loss does not guarantee Exactly Once, and if the Receiver is received and saved and does not have time to update the updateOffsets, it will cause the data to be reprocessed.

Explain in more detail the scenario in which data is read repeatedly:

Receiver crashes when Receiver receives the data and saves it to HDFS. Before the persistence engine restarts the updateOffset,Receiver, it will read the metadata again from the ZooKeeper that manages the Kafka, resulting in repeated reading of the metadata. From the point of view of SparkStreaming, it is successful, but Kafka believes that it is a failure (because the offsets was not updated to ZooKeeper in time when Receiver crashed), it will be re-consumed when it is restored, which will lead to data re-consumption.

Performance supplement:

Data is guaranteed not to be lost by WAL, but the disadvantage is that WAL will greatly damage the performance of Receiver in SparkStreaming to receive data (the current network production environment usually deals with Kafka direct api directly).

It should be noted that if you use Kafka as the data source, there is data in Kafka, and then there will be a copy of the data when Receiver accepts the data, which is actually a waste of storage resources. The solution of repeatedly reading data, when reading the data, you can put the metadata information into the in-memory database, and check whether the metadata has been calculated again.

Spark1.3 provides Kafka direct api to avoid the performance loss of WAL and to implement Exactly Once. Kafka is used as a file storage system. At this time, Kafka has both the advantages of streaming and the advantages of file system. At this point, Spark Streaming+Kafka has built a perfect streaming world!

Data does not need a copy of copy, does not need WAL performance loss, does not need Receiver, but directly consumes data through kafka direct api, and all Executors consumes data directly through kafka api and manages offset directly, so it does not repeat consumption data; transaction consistency is achieved!

The last question is about multiple rewrites of Spark Streaming data output and solutions:

Why is this a problem? because SparkStreaming naturally does the following things based on SparkCore,SparkCore when calculating, resulting in repeated output of SparkStreaming results (in part):

1.Task retry

two。 Slow task speculation

3.Stage repeat

4.Job retry

Can lead to the loss of data.

Corresponding solution:

1. A failed task is a job failure. Set the number of spark.task.maxFailures to 1

two。 Set spark.speculation to off (because slow tasks are supposed to consume a lot of performance, turning off can significantly improve Spark Streaming processing performance)

3.Spark streaming on kafka, if the job fails, you can set the auto.offset.reset of kafka to largest, which will automatically resume the execution of job.

Finally, it is emphasized again:

You can use transform and foreachRDD to control logic based on business logic code to achieve non-repetitive data consumption and output! These two methods are similar to the back door of spark streaming and can do any imaginary control operation!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.