How to decrypt the SparkStreaming operating mechanism 04/27 Update SLTechnology News&Howtos

How to decrypt the SparkStreaming operating mechanism

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to decrypt the SparkStreaming operating mechanism, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

1: the relationship of each sub-frame of spark:

In the last lesson, we used a dimensionality reduction method to look at the approximate running process of the whole spark streaming. Again, spark streaming is actually an application built on spark core. If you want to build a powerful spark application, spark streaming is a reference for reference. Spark streaming involves the cross-coordination of multiple job, which involves all the core components of spark, if you are proficient in spark streaming. It can be said that you have mastered the whole spark, so it is very important to be proficient in spark streaming.

The various subframeworks of spark are based on spark core. The internal processing mechanism of spark streaming is to accept real-time streaming data and divide it into batches of data according to a certain time interval, then process these batches of data through spark engine, and finally get the processed batches of data.

The corresponding batch data corresponds to RDD in spark kernel, DStream in spark streaming, a template of DStream equivalent to RDD, and a set of RDD (a sequence of RDD).

According to popular understanding, after the data is divided into batches, it goes through a queue, and then the spark engine takes the batch data one by one from the queue and encapsulates the batch data into a DStream. Because DStream is a template of RDD and a logical level abstraction of RDD, it essentially encapsulates the data into a physical RDD.

Second, understanding the basic concepts of Spark Streaming:

In order to better understand spark streaming, let's simply understand the relevant concepts.

1 discrete stream: (Discretized Stream, DStream): this is spark streaming's abstract description of the internal continuous real-time data stream, that is, a real-time data stream we are dealing with, which corresponds to a DStream in spark streaming

2 batches of data: real-time streaming time is divided into batches in units of time, and data processing is converted into batch processing of time slice data.

3 time slice or batch processing time interval: the quantitative standard of the data at the logical level, using the time slice as the basis for splitting the streaming data.

4 window length: the length of time of stream data covered by a window. For example, if you want to count the data of the past 30 minutes every 5 minutes, the window length is 6, because 30 minutes is 6 times that of batch interval.

5 sliding interval: for example, to count the data of the past 30 minutes every 5 minutes, and the window interval is 5 minutes.

6 inputDStream: an inputDStream is a special DStream that connects spark streaming to an external data source to read data.

7 Receiver: run on Excutor for a long time (maybe 7-24 hours), with each Receiver responsible for an inuptDStream (such as reading an input stream of kafka messages). Each Receiver, plus inputDStream, takes up one core/slot.

Here comes the point! We use spatio-temporal dimension and spatial dimension to understand DStream and RDD respectively to have a deeper and alternative understanding of spark streaming and the relationship between them.

DStream corresponds to space-time dimension, space plus time (at present, one of the subtleties of spark streaming is to use time to decouple, which is the best way to decouple), RDD corresponds to space-time dimension, and the whole sparkStreaming is space-time dimension.

The vertical axis is the spatial dimension: represents the concrete processing logic steps formed by the dependency of RDD, which is expressed in DStream.

The horizontal axis is the time dimension: job objects are continuously generated at specific intervals and run on the cluster.

With the passage of time, RDD Graph is generated continuously based on DStream Graph, that is, job is generated by DAG, and submitted to spark cluster for continuous execution through the thread pool of Job Scheduler. (sparkStreaming only pays attention to the time dimension, not the space dimension)

As can be seen from the above, the relationship between RDD and DStream is as follows:

1.RDD is physical, while DStream is logical.

2.DStream is a wrapper class for RDD and a further abstraction of RDD

3.DStream is the template for RDD. DStream relies on RDD for specific data calculations.

(note: the vertical axis dimension requires the generation template of RDD,DAG and the job controller of TimeLine

Horizontal axis dimension (time dimension) includes batch interval, window length, window sliding time, and so on. )

4.inputStream and outputStream represent the input and output of data respectively

5. The specific job runs on spark cluster, at this time the fault tolerance of the system is very important, and the fault tolerance of spark streaimg is very ingenious, it skillfully borrows the fault tolerance of spark core rdd. (RDD can specify StorageLevel to store multiple copies for fault tolerance.)

6. Transaction processing: data must be processed, and the data will be processed only once, which is very important for implementations such as billing systems

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.