What is the structure and principle of Spark Streaming in big data? 09/19 Update SLTechnology News&Howtos

What is the structure and principle of Spark Streaming in big data?

2025-09-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

What this article shares with you is about the structure and principle of Spark Streaming in big data. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Flow calculation

Timeliness of data

In our daily work, we usually store the data in a table first, and then process and analyze the data in this table. Since data is to be stored in a table, there is the concept of timeliness.

If we are dealing with other grade data, such as demographic analysis and macroeconomic analysis, it doesn't matter if the latest date is a week or two, or even a month or two, tonight.

If we are dealing with day-level data, such as user preference analysis and retail supply and marketing analysis of major websites, it is generally possible to be a few days later, that is, Tunable N update.

If it is hour-level data, the timeliness requirements are even higher, such as financial risk control, involving the security of funds, there must be a piece of hour-level data.

So do you have anything more demanding? Of course, for example, risk monitoring, the website must have a real-time monitoring system, and once there is an attack, measures must be taken immediately. During the Singles' Day or the anniversary, all major e-commerce platforms are experiencing severe traffic tests. The system must also be monitored in real time. In addition, the real-time personalized recommendation of the website and the search engine also have high requirements for real-time.

In this scenario, the traditional data processing process-first collecting data, then putting it into DB, and then taking it out for analysis-can not meet such high real-time requirements.

Streaming computing emerges as the times require in real-time or quasi-real-time scenarios.

(1) unlike batch computing, which slowly accumulates data, streaming computing spreads a large amount of data evenly to each point in time and transmits it in small batches continuously. The data continues to flow and is discarded after calculation.

(2) batch computing is to maintain a table and implement various calculation logic on the table. Streaming computing, on the contrary, must first define the computing logic and submit it to the streaming computing system, which is unchangeable throughout the operation.

(3) in terms of calculation results, batch computing transmits the results after calculating all the data. Flow computing is that after each small batch calculation, the results can be delivered to the online system immediately to achieve real-time display.

(1) streaming computing process

① submits a stream computing job.

② waits for streaming data to trigger a streaming computing job.

③ calculation results continue to be written out to the public.

(2) characteristics of flow calculation

① real-time, low latency

② is unbounded, data is endless.

The ③ is continuous, the calculation continues, and the data is discarded after calculation.

Apache Storm

In Storm, it is necessary to design a graph structure for real-time computing, which is called topology. This topology will be submitted to the cluster, where the master node (master node) distributes the code and assigns tasks to the worker node (worker node) for execution. A topology includes two roles, spout and bolt, in which spout sends messages and is responsible for sending data streams in the form of tuple tuples, while bolt is responsible for converting these data streams, which can be calculated and filtered in bolt, and bolt itself can randomly send data to other bolt. The tuple emitted by spout is an invariant array corresponding to a fixed key-value pair.

Apache Flink

Flink is a distributed processing engine for streaming data and batch data. It is mainly implemented by Java code. For Flink, the main scenario it has to deal with is stream data, and batch data is only a limited special case of stream data. In other words, Flink treats all tasks as streams, which is its biggest feature. Flink can support local fast iterations, as well as some circular iterative tasks. And Flink can customize memory management. At this point, if you compare Flink with Spark, Flink does not fully hand over memory to the application layer. This is why Spark is more likely to have OOM than Flink (out of memory). In terms of the framework itself and application scenarios, Flink is more similar to Storm.

Apache Spark Streaming

Spark Streaming is an extension of the core Spark API that does not process data streams one at a time as Storm does, but instead splits them into segments of batch jobs at intervals before processing. The abstraction of Spark for persistent data flow is called DStream (DiscretizedStream), a DStream is a micro-batching RDD (resilient distributed dataset), while RDD is a distributed dataset that can operate in parallel in two ways, namely the conversion of arbitrary functions and sliding window data.

Comparison of Storm, Flink and Spark Streaming

The choice of Storm, Flink and Spark Streaming

If you want a high-speed event processing system that allows incremental computing, Storm will be the best choice.

If you must have stateful computing, just once delivery, and do not mind high latency, then consider Spark Streaming, especially if you also plan graphics operations, machine learning, or access to SQL, Apache Spark's stack allows you to combine some library with data flow (Spark SQL,Mllib,GraphX), which will provide a convenient all-in-one programming model. In particular, data flow algorithms (for example, K-means streaming media) allow the promotion of Spark real-time decision-making.

Flink supports incremental iterations and automatically optimizes iterations. In iterative data processing, Flink is more prominent than Spark. Flink is based on line-by-line streaming of each event. Real streaming computing is similar to Storm in performance, supporting millisecond computing, while Spark can only support second-level computing.

Introduction to Spark Streaming

Spark Streaming is an extension of Spark core API, which can achieve high-throughput, fault-tolerant real-time stream data processing. Support a variety of data sources to obtain data, including Kafka, Flume, Zero MQ,Kinesis and TCP Sockets. After obtaining data from the data source, you can use advanced functions such as map, reduce, join and window to process complex algorithms. Finally, the processing results can be stored in the file system, database and field dashboard.

On the basis of "One Stack rule them all", other sub-frameworks of Spark, such as cluster learning, graph calculation, etc., can be used to process streaming data.

Each sub-framework of Spark is based on Spark Core. The internal processing mechanism of Spark Streaming is to receive real-time stream data and split them into batches of data according to a certain time interval, and then process these batches of data through Spark Enging, and finally get batches of result data after processing. The corresponding batch data corresponds to a RDD instance in the Spark kernel, so the DStream of the corresponding stream data can be regarded as a set of RDDS, that is, a sequence of RDD. According to popular understanding, after the stream data is divided into batches, it goes through a first-in, first-out queue, and then Spark Enging takes out batch data from the queue in turn, encapsulates the batch data into RDD, and then processes them. This is a typical producer / consumer model, which corresponds to the problem of producer-consumer model, that is, how to coordinate production rate and consumption rate.

Discrete flow (discretized stream) or DStream

This is SparkStraming's abstract description of the internal continuous real-time data flow, that is, a real-time data stream we are dealing with, which corresponds to an DStream instance in Spark Streaming.

Batch data (batch data)

This is the first step in breaking up the real-time stream data into batches in time slices, and the stream processing is transformed into batch processing of time slice data. With the passage of time, these processing results form the corresponding result data stream.

Time slice or batch interval (batch interval)

This is the standard for artificially quantifying the data flow, and the time slice is used as the basis for us to split the data stream. The data of a time slice corresponds to an instance of RDD.

Window length (window length)

The length of time of stream data covered by a window. Must be a multiple of the batch interval.

Sliding time interval

The length of time between the previous window and the next window. Must be a multiple of the batch interval.

Input DStream

An input DStream is a special DStream that connects the Spark Streaming to an external data source to read the data.

Spark Streaming architecture

In Spark Streaming, data processing is carried out in batches, and data collection is carried out one by one, so the batch processing interval (batch duration) will be set in advance in Spark Streaming. When the batch processing interval is exceeded, the collected data will be summarized as a batch of data to be processed in a system area.

For the window operation, there will be N batch data inside the window, the size of the batch data is determined by the window interval (window duration), and the window interval refers to the duration of the window. In the window operation, the batch processing will be triggered only if the length of the window is satisfied. In addition to the length of the window, another important parameter for window operation is the sliding interval (slide duration), which refers to how long it takes for the window to slide to form a new window, which is the same as the batch interval by default, and the window interval is generally set to be larger than the two of them. One thing to note here is that the size of the sliding interval and window interval must be set to an integral multiple of the batch interval.

Spark Streaming is a high-throughput, fault-tolerant streaming system for real-time data streams. It can perform complex operations such as Map, Reduce and Join on a variety of data sources (such as Kafka, Flume, Zero MQ and TCP sockets), and save the results to external file systems, databases or real-time dashboards.

Calculation flow

Spark Streaming is the decomposition of streaming computing into a series of short batch jobs. The batch engine here is Spark Core, that is, the input data of Spark Streaming is divided into segments of data (Discretized Stream) according to batch size (such as 1 second), each piece of data is converted into RDD (Resilient Distrbute Dataset) in Spark, and then the Transformation operation on DStream in Spark Streaming is changed into Transformation operation on RDD in Spark, and the RDD is stored in memory after the operation. The whole streaming computing can overlay the intermediate results or store them to external devices according to the needs of the business.

Fault tolerance

For streaming computing, fault tolerance is very important. First of all, we need to clarify the fault-tolerant mechanism of RDD in Spark. Each RDD is an immutable distributed recalculable data set, which records deterministic operational inheritance relations (lineage), so as long as the input data is fault-tolerant, the Partition of any RDD can be recalculated by conversion operations using the original input data.

For Spark Streaming, the inheritance relationship of its RDD is shown in the following figure, each oval in the figure represents a RDD, each circle in the oval represents a Partition in a RDD, multiple RDD in each column in the figure represents a DStream (there are three DStream in the figure), and the last RDD in each row represents the intermediate result RDD generated by each Batch Size lock. We can see that each RDD in the figure is connected through lineage, because the Spark Streaming input data can come from disk, such as HDFS (multiple copies) or data flow from the network (Spark Streaming copies each data stream of the network input data to other machines) can guarantee fault tolerance, so any Partition error in RDD can calculate the missing Partition on other machines in parallel. This fault-tolerant recovery method is more efficient than continuous computing models such as Storm.

Real-time performance

The discussion of real-time will involve the application scenario of streaming framework. Spark Streaming decomposes streaming computing into multiple Spark Job, and the processing of each piece of data goes through the decomposition of Spark DAG diagram and the scheduling process of Spark task set. For the current version of Spark Streaming, the minimum Batch Size is chosen between 0.5 and 2 seconds (the smallest delay of Stom is about 100ms), so Spark Streaming can meet all streaming quasi-real-time computing scenarios except for very high real-time requirements (such as high-frequency real-time transactions).

Scalability and Throughput

Spark has been able to scale linearly to 100 nodes (4Core per node) on EC2, can process the data volume of 6GB/s (60m records/s) with a delay of several seconds, and its throughput is 2 to 5 times higher than that of the popular Storm. The following is the test done by Berkeley using WordCount and Grep use cases.

Spark Streaming persistence

Like RDD, DStream can also store data streams in memory through the persist () method. The default persistence method is MEMORY_ONLY_SER, which stores data in memory and serializes at the same time. The advantage of this is that when you encounter programs that require multiple iterations, the speed advantage is very obvious. For window-based operations, such as reduceByWindow, reduceByKeyAndWindow, and state-based operations, such as updateStateByKey, the default persistence strategy is to keep them in memory.

For data sources from the network (Kafka, Flume, sockets, etc.), the default persistence strategy is to store the data on two machines, which is also designed for fault tolerance.

In addition, for windows and stateful operations must checkpont, through the checkpoint of StreamingContext to specify the directory, through the checkpoint of DStream to specify the interval, the interval must be a multiple of the sliding interval (slide interval).

Spark Streaming performance optimization

1. Optimize the running time

Increase parallelism

Be sure to use the resources of the entire cluster, rather than centralizing tasks on a few specific nodes. For operations that include shuffle, increase their parallelism to ensure that cluster resources are more fully utilized.

Reduce the burden of data serialization and deserialization

Spark Streaming stores the received data after serialization by default to reduce memory usage. But serialization and deserialization require more CPU time, so more efficient serialization methods and custom serialization interfaces make more efficient use of CPU.

Set a reasonable batch duration (batch time)

In Spark Streaming, there may be dependencies between Job, and the subsequent Job must ensure that the previous job is completed before it can be submitted. If the execution time of the previous Job exceeds the batch interval, the subsequent Job will not be able to commit on time, which will further delay the next Job and cause blocking of the subsequent Job. Therefore, it is necessary to set a reasonable batch interval to ensure that the job can end within this batch interval.

2. Optimize memory usage

Control batch size (the amount of data in the batch interval)

Spark Streaming stores all data received during the batch interval in the available memory area within the Spark, so you must ensure that the available memory of the current node Spark can hold at least all the data during this batch interval, otherwise new resources must be added to improve the processing power of the cluster.

Clean up data that is no longer used in a timely manner

As mentioned earlier, Spark Streaming will clean up the accepted data in a timely manner to ensure that Spark Streaming has extra memory space available. Clean up the time-out useless data in time by setting a reasonable length of spark.cleaner.ttl. This parameter needs to be set carefully to prevent the data needed in subsequent operations from being handled by timeout errors.

Observe and adjust GC strategy appropriately

GC will affect the normal operation of Job, may prolong the execution time of Job, and cause a series of unpredictable problems. Observe the operation of GC and adopt different GC strategies to further reduce the impact of memory recovery on Job operation.

The above is what is the structure and principle of Spark Streaming in big data. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.