How to understand the data Reliability and consistency of Spark Streaming 07/11 Update SLTechnology News&Howtos

How to understand the data Reliability and consistency of Spark Streaming

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to understand the reliability and consistency of Spark Streaming data, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

At present, one of the hottest words in big data's field is flow calculation, among which the most dazzling project is undoubtedly the Spark Streaming project from the Spark community, which has received wide attention and rapid development since its birth, and now has a tendency to catch up with and surpass Storm.

For stream computing, there is no doubt that the core feature is its low latency, which mainly comes from the internal mechanism of computing data without falling on disk, but it also brings the problem of data reliability. that is, how to negotiate properly between nodes to retransmit when there is a node failure or network anomaly. Further, if unplanned data retransmission occurs, how can we ensure that no duplicate data is generated and that all data is Exact Once? If these problems are not solved, big data's flow computing will not be able to meet most enterprise-level reliability requirements and will be in vain.

The following will focus on how Spark Streaming designs reliability mechanisms and achieves data consistency.

Driver HA

Because the stream computing system runs for a long time and data continues to flow in, the reliability of its Spark daemon (Driver) is very important, which determines whether the Streaming program can run correctly all the time.

Figure 1 Driver data persistence

The solution for Driver to implement HA is to persist the metadata so that the state can be restored after a restart. As shown in figure 1, the metadata for Driver persistence includes:

Block metadata (green arrow in figure 1): block metadata generated by the data received by Receiver from the network and assembled into Block

Checkpoint data (orange arrow in figure 1): includes configuration items, DStream operations, incomplete Batch status, and generated RDD data, etc.

Figure 2 Driver fault recovery

After a failed restart of Driver:

Restore computing (orange arrow in figure 2): restart driver with Checkpoint data, reconstruct the context, and restart the receiver.

Restore metadata blocks (green arrow in figure 2): restore block metadata.

Restore outstanding jobs (the red arrow in figure 2): using the recovered metadata, the RDD and the corresponding job are generated again, and then submitted to the Spark cluster for execution.

Through the above data backup and recovery mechanism, Driver can restart after a failure and still recover Streaming tasks without losing data, thus providing system-level data with high reliability.

Reliable upstream and downstream IO system

Stream computing mainly realizes the data exchange with the external IO system through network socket communication. Because of the unreliable characteristics of network communication, the sender and receiver need to use certain protocols to ensure the receiving acknowledgement and failure retransmission mechanism of data packets.

Not all IO systems support retransmission, which requires at least persistence of data flow, high throughput and low latency. Among the data source officially supported by Spark Streaming, only Kafka can meet these requirements at the same time, so in the recent Spark Streaming release, Kafka is also regarded as the recommended external data system.

In addition to using Kafka as an input data source (inbound data source), it is usually also used as an output data source (outbound data source). All real-time systems subscribe and distribute data through Kafka, which is a MQ, so as to decouple the producers and consumers of streaming data.

The data flow view of a typical enterprise big data center is as follows:

Figure 3 View of data flow in big data Center of the Enterprise

In addition to ensuring that the data can be retransmitted from the source, Kafka is an important guarantee for the Exact Once semantics of streaming data. Kafka provides a set of low-level API that enables client to access topic data streams as well as its metadata. Each receiving task of Spark Streaming can obtain the data flow from the specified Kafka topic, partition and offset, and the data boundary of each task is very clear. After the task fails, it can re-receive this part of the data without producing "overlapping" data, thus ensuring that the stream data is "available and only processed once".

A reliable receiver

Prior to Spark 1.3, Spark Streaming pulled data streams from the Kafka cluster by starting a dedicated Receiver task.

After the Receiver task starts, the topicMessageStreams object is created using Kafka's advanced API, and the data flow cache is read one by one, and each batchInerval time is submitted by JobGenerator to generate a spark calculation task.

Because of the risk of downtime in Receiver tasks, Spark provides an advanced reliable receiver-ReliableKafkaReceiver type for reliable data collection. It uses the WAL (Write Ahead Log) function provided by Spark 1.2 to persist each batch of data received to disk, update the offset information of topic-partition, and then receive the next batch of Kafka data. In case of Receiver failure, the received data can be recovered from the WAL after restart, thus avoiding data loss caused by Receiver node downtime (the following code removes the details of logic):

Class ReliableKafkaReceiver {private var topicPartitionOffsetMap: mutable.HashMap [TopicAndPartition, Long] = null private var blockOffsetMap: ConcurrentHashMap [StreamBlockId, Map [TopicAndPartition, Long]] = null override def onStart (): Unit = {/ / Initialize the topic-partition / offset hash map. TopicPartitionOffsetMap = new mutable.HashMap [TopicAndPartition, Long] / / Initialize the block generator for storing Kafka message. BlockGenerator = new BlockGenerator (new GeneratedBlockHandler, streamId, conf) messageHandlerThreadPool = Utils.newDaemonFixedThreadPool (topics.values.sum, "KafkaMessageHandler") blockGenerator.start () val topicMessageStreams = consumerConnector.createMessageStreams (topics, keyDecoder, valueDecoder) topicMessageStreams.values.foreach {streams = > streams.foreach {stream = > messageHandlerThreadPool.submit (new MessageHandler (stream))}

Although the data reliability risk of Receiver is reduced after WAL is enabled, the overall system throughput decreases significantly due to the overhead caused by disk persistence. Therefore, in the latest release of Spark 1.3, Spark Streaming adds the use of Direct API to access Kafka data sources.

With the introduction of Direct API, Spark Streaming no longer starts resident Receiver receive tasks, but assigns directly to each Batch and the latest topic partition offset of RDD. After job starts and runs, Executor uses the simple consumer API of Kafka to get the data of that section of offset.

The benefit of this approach not only avoids the risk of data reliability caused by Receiver downtime, but also achieves an accurate one-off of data by avoiding using ZooKeeper for offset tracking (the following code removes the details of logic):

Class DirectKafkaInputDStream {protected val kc = new KafkaCluster (kafkaParams) protected var currentOffsets = fromOffsets override def compute (validTime: Time): Option [KafkaRDD [K, V, U, T, R]] = {val untilOffsets = clamp (latestLeaderOffsets (maxRetries)) val rdd = KafkaRDD [K, V, U, T, R] (context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler) currentOffsets = untilOffsets.map (kv = > kv._1-> kv._2.offset) Some (rdd)} pre-written log Write Ahead Log

Spark 1.2 began to provide pre-write logging capabilities for persistence and fault recovery of Receiver data and Drive metadata. WAL provides persistence because it takes advantage of reliable HDFS for data storage.

The core API of the Spark Streaming prewrite logging mechanism includes:

Manage the WriteAheadLogManager of WAL files

Read / write WriteAheadLogWriter and WriteAheadLogReader of WAL

RDD:WriteAheadLogBackedBlockRDD based on WAL

Partition:WriteAheadLogBackedBlockRDDPartition based on WAL

The interactive diagram of the above core API in the data receiving and recovery phase is shown in figure 4.

Fig. 4 schematic diagram of data reception and recovery based on WAL

You can clearly see from the source code of WriteAheadLogWriter that every time you write a piece of data buffer to HDFS, you will call the flush method to force it to be flushed into the disk, and then go to fetch the next piece of data. Therefore, the data received by receiver can be guaranteed to be persisted to disk, thus achieving better data reliability.

Private [streaming] class WriteAheadLogWriter {private lazy val stream = HdfsUtils.getOutputStream (path, hadoopConf) def write (data: ByteBuffer): WriteAheadLogFileSegment = synchronized {data.rewind () / / Rewind to ensure all data in the buffer is retrieved val lengthToWrite = data.remaining () val segment = new WriteAheadLogFileSegment (path, nextOffset LengthToWrite) stream.writeInt (lengthToWrite) if (data.hasArray) {stream.write (data.array ())} else {while (data.hasRemaining) {val array = new Array [Byte] (data.remaining) data.get (array) stream.write (array)} flush () nextOffset = stream.getPos () segment} read the above Have you mastered how to understand the data reliability and consistency of Spark Streaming? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.