Lesson 2: a thorough understanding of SparkStreaming through a case 07/16 Update SLTechnology News&Howtos

Lesson 2: a thorough understanding of SparkStreaming through a case

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The contents of this issue:

1 decryption Spark Streaming operation mechanism

2 decrypt Spark Streaming architecture

All data that cannot be processed by real-time streaming is invalid. In the era of stream processing, SparkStreaming has a strong attraction and broad prospects. Coupled with the ecosystem of Spark, Streaming can easily call other powerful frameworks such as SQL,MLlib, and it will dominate the world.

The Spark Streaming runtime is not so much a streaming framework on Spark Core as one of the most complex applications on Spark Core. If you can master the complex application of Spark streaming, then there are no other complex applications. It is also a general trend to choose Spark Streaming as the entry point for version customization.

We know that every step that Spark Core handles is based on RDD, and there are dependencies between RDD. The DAG of RDD in the above figure shows that there are 3 Action, which will trigger the bottom-up dependence of 3 job,RDD, and the job generated by RDD will be executed. As you can see from DSteam Graph, the logic of DStream is basically the same as that of RDD, which adds time dependence on the basis of RDD. The DAG of RDD can also be called space dimension, that is to say, the whole Spark Streaming has one more time dimension, and it can also become space-time dimension.

From this point of view, Spark Streaming can be placed in the coordinate system. The Y axis is the operation of RDD, the dependency of RDD constitutes the logic of the whole job, and the X axis is time. Over time, a fixed time interval (Batch Interval) generates a job instance that runs in the cluster.

For Spark Streaming, when data flows from different data sources, a series of fixed data sets or event collections (for example, from flume and kafka) are formed based on fixed time intervals. This coincides with the fact that RDD is based on fixed data sets. In fact, RDD Graph based on fixed time intervals by DStream is based on a certain batch data set.

As can be seen from the above figure, the RDD dependency of the spatial dimension is the same on each batch, but the difference is that the data size and content of the five batch inflows are different, so different RDD dependency instances are generated, so RDD's Graph is born from DStream's Graph, that is to say, DStream is the template of RDD, and different RDD Graph instances are generated at different time intervals.

Starting from Spark Streaming itself:

1. Need the generation template of RDD DAG: DStream Graph

2 need a Timeline-based job controller

3 inputStreamings and outputStreamings are required to represent the input and output of data

4 the specific job runs on the Spark Cluster, because the streaming regardless of whether the cluster can be digested or not, the system fault tolerance is very important.

5 transaction processing, we hope that the data flowing in will be processed, and only once. How to ensure the transaction semantics of Exactly once in the event of a crash.

Interpretation of DStream from Source Code

As you can see here, DStream is the core of Spark Streaming, just as the core of Spark Core is RDD, which also has dependency and compute. More crucially, the following code:

This is a HashMap, with time as key and RDD as value, which proves that over time, RDD is constantly generated, dependency job is generated, and jobScheduler is used to run on the cluster. It is verified again that DStream is the template of RDD.

DStream can be said to be at the logical level, RDD is at the physical level, and what DStream expresses is ultimately realized through the transformation of RDD. The former is a higher level of abstraction, while the latter is the underlying implementation. DStream is actually the encapsulation of the RDD collection in the time dimension. The relationship between DStream and RDD is the continuous generation of RDD with the passage of time, and the operation of DStream is to operate RDD at a fixed time.

Summary:

The business logic in the spatial dimension acts on the DStream. With the passage of time, each Batch Interval forms a specific data set, generates RDD, transform the RDD, and then forms the RDD dependency RDD DAG, forming the job. Then jobScheduler according to the time schedule, based on the dependency of RDD, publish jobs to Spark Cluster to run, and constantly generate Spark jobs.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.