How does SparkStreaming work in getting started with Spark2.x 07/06 Update SLTechnology News&Howtos

How does SparkStreaming work in getting started with Spark2.x

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is to share with you about the working principle of SparkStreaming in the introduction to Spark2.x. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

The general meaning of the translation on the official website is as follows:

SparkStreaming is an extension of the core SparkApi, supporting scalable, high-throughput, fault-tolerant real-time data stream processing. Data can be obtained from many sources, such as Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms represented by high-level functions such as map, reduce, join, and window. Finally, the processed data can be pushed to the file system, database, and active dashboard. In fact, you can apply Spark's machine learning and graphics processing algorithms to data streams.

How it works: SparkStreaming accepts real-time input data streams and divides the data into batches, which are then processed by Spark engine to generate the final result stream in batches.

DStream is the basic abstraction provided by the SparkStreaming stream. It represents a continuous data stream, either an input data stream received from a source or a processed data stream generated by transforming the input stream. Internally, DStream is represented by a series of consecutive RDD, and RDD is Spark's abstraction of immutable distributed data sets. Each RDD in the DStream contains data from an interval, as shown in the following figure.

Any operation applied to the DStream is converted to an operation on the underlying RDD. For example, in the previous example of converting a row stream to a word, the flatMap operation is applied to each RDD in the row DStream to generate the RDD of the word DStream. This is shown in the following figure.

These underlying RDD transformations are calculated by the Spark engine. The DStream operation hides most of these details and provides developers with a more advanced API. These operations are discussed in detail in later sections.

Comparative Analysis of three streaming processing frameworks: SparkStreaming, Flink and Storm

SparkStreamingFlinkStorm Throughput

High throughput, high throughput and low throughput real-time performance

Second delay low delay, millisecond (100 millisecond) low delay, millisecond (tens of milliseconds) out of order, delay processing

None

Flink supports out-of-order and delay processing through warterMarker watermarking. This spark does not have none.

Guaranteed times

Exactly-onceexactly-onceat-least-once

Dynamic adjustment of parallelism

Fault tolerance is not supported

Checkpoint based on RDD

Checkpoint based on distributed Snapshot

Ack Mechanism based on Record record

This is how SparkStreaming works in the introduction to Spark2.x. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.