Summary of Spark Streaming technical points 07/01 Update SLTechnology News&Howtos

Summary of Spark Streaming technical points

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Spark Streaming supports scalable (Scalable), high throughput (high-throughput) and fault tolerant (fault-tolerant) stream processing (stream processing) of real-time data streams.

Architecture diagram

The characteristics are as follows:

Can scale linearly to more than hundreds of nodes

Implement subsecond delay processing

Seamless integration with Spark batch and interactive processing

Provide simple API to implement complex algorithms

More streaming methods are supported, including Kafka, Flume, Kinesis, Twitter, ZeroMQ, etc.

001, principle

After receiving the real-time input data stream, Spark divides the data into batches (divides the data into batches), then transfers them to Spark Engine for processing, and generates the final result stream (generate the final stream of results in batches) by batches.

002 、 API

DStream:

DStream (Discretized Stream, discrete stream) is a high-level abstract continuous data stream provided by Spark Stream.

Composition: a DStream can be regarded as a RDDs sequence.

Core idea: computing is regarded as a series of small interval, state-independent, batch-determining tasks, and the input data received in each interval is reliably stored in the cluster as an input data set.

Features: a high-level functional programming API, strong consistency and university fault recovery.

Application template:

Template 1

Template 2

WordCount example

Input DStream:

Input DStream is a DStream that obtains the original data stream from a streaming data source, which is divided into basic input sources (file system, Socket, Akka Actor, custom data sources) and advanced input sources (Kafka, Flume, etc.).

Receiver:

Each Input DStream (except the file stream) corresponds to a single Receiver object that receives data from the data source and stores it in Spark memory for processing. Multiple Input DStream can be created in the application to receive multiple data streams in parallel.

Each Receiver is a Task that runs on Worker or Executor for a long time, so it takes up a core of the application. If the number of cores assigned to a Spark Streaming application is less than or equal to the number of Input DStream (that is, the number of Receiver), you can only receive data but do not have the ability to process all of them (except for the file stream, because no Receiver is required).

Spark Streaming has encapsulated a variety of data sources and refer to the official documentation when needed.

Transformation Operation

Commonly used Transformation

UpdateStateByKey (func)

UpdateStateByKey can reduce the data in DStream according to key, and then accumulate the data of each batch.

UpdateStateByKey version of WordCount

Transform (func)

Create a new DStream by applying a conversion function to each RDD of the original DStream.

Examples of official document codes

Window operations

Window operation: data transformation based on window (I think it is similar to Storm's tick, but more powerful).

Parameters: window length (window length) and sliding interval (slide interval) must be multiples of the source DStream batch interval.

For example: the window length is 3 and the sliding interval is 2; the previous line is the original DStream and the next line is the windowed DStream.

Common window operation

Examples of official document codes

Join (otherStream, [numTasks])

Connection data flow

Official document code example 1

Official document code example 2

Output Operation

Caching and persistence:

Each RDD in the DStream is stored in memory through persist ().

Window operations is automatically persisted in memory without the need to display a call to persist ().

When persist () is executed by data streams received through the network (such as Kafka, Flume, Socket, ZeroMQ, RocketMQ, etc.), the serialized data is persisted on both nodes by default to achieve fault tolerance.

Checkpoint:

Purpose: Spark is based on fault-tolerant storage systems such as HDFS, S3 for fault recovery.

Classification:

Metadata checkpoint: saves streaming computing information for failure recovery of Driver running nodes, including creating application configurations, application-defined DStream operations, queued but incomplete batches.

Data checkpoint: saves the generated RDD. Because stateful transformation needs to merge multiple batches of data, that is, the generated RDD depends on the data (dependency chain) of the previous batches of RDD, in order to shorten the dependency chain and thus reduce the failure recovery time, the intermediate RDD needs to be saved to reliable storage (such as HDFS) on a regular basis.

Timing of use:

Stateful transformation:updateStateByKey () and window operations.

Applications that require Driver recovery.

003. Usage

Stateful transformation

Applications that require Driver recovery (take WordCount as an example): if the checkpoint directory exists, create a new StreamingContext; based on the checkpoint data otherwise (such as running it for the first time) create a new StreamingContext.

Checkpoint interval

Methods:

Principle: generally set to 5-10 times the sliding interval.

Analysis: checkpoint increases storage overhead and batch processing time. When the batch interval is small (such as 1 second), checkpoint may reduce the operation throughput; conversely, a large checkpoint interval will result in an increase in the number of lineage and task.

004. Performance tuning

Reduce batch processing time:

Data reception parallelism

Add DStream: when receiving network data (such as Kafka, Flume, Socket, etc.), the data will be deserialized and then stored in Spark. Since a DStream has only Receiver objects, you can consider adding DStream if it becomes a bottleneck.

Set the "spark.streaming.blockInterval" parameter: the received data is stored in Spark memory and will be merged into block, and the number of block determines the number of Task; for example, when the batch interval is 2 seconds and the block interval is 200ms, the number of Task is about 10; if the number of Task is too low, CPU resources are wasted; the recommended minimum block interval is 50ms.

Explicit rezoning of the Input DStream: repartition the input data before doing any deeper processing.

Data processing parallelism: operation such as reduceByKey and reduceByKeyAndWindow can be controlled by setting "spark.default.parallelism" parameters or explicitly setting parallelism method parameters.

Data serialization: more efficient Kryo serialization can be configured.

Set reasonable batch interval

Principle: the speed of processing data should be greater than or equal to the speed of data input, that is, the batch processing time is greater than or equal to the batch interval.

Methods:

First set the batch interval to 5-10 seconds to slow down the data input speed

Then by looking at the "Total delay" in the log4j log, gradually adjust the batch interval to ensure that "Total delay" is less than the batch interval.

Memory tuning

Persistence level: enable compression and set the parameter "spark.rdd.compress".

GC policy: enable CMS on Driver and Executor.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.