How to carry out Spark Streaming Computing Model and Monitoring 04/09 Update SLTechnology News&Howtos

How to carry out Spark Streaming Computing Model and Monitoring

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to carry out Spark Streaming computing model and monitoring, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Abstract

Spark Streaming is an excellent real-time computing framework. Its good scalability, high throughput and fault tolerance mechanism can meet many of our scenario applications. Combined with our application scenario, the following introduces our technical architecture in using Spark Streaming, and focuses on the two computing models of Spark Streaming, stateless and stateless computing models, as well as the matters needing attention of these two models; then introduces some things that Spark Streaming does in monitoring, and finally summarizes the advantages and disadvantages of Spark Streaming.

I. Overview

Data is a very valuable resource, which is of great value to enterprises at all levels. However, with the explosion of data, the original stand-alone data processing has been unable to meet the needs of the business scenario. Therefore, on this basis, there are some excellent distributed computing frameworks, such as Hadoop, Spark and so on. Although the offline distributed processing framework can handle a very large amount of data, its lag is difficult to meet some specific demand scenarios, such as push feedback, real-time recommendation, real-time user behavior and so on. In order to meet these scenarios, data processing can achieve real-time response and feedback, followed by the emergence of a real-time computing framework. The current real-time processing frameworks include Apache Storm, Apache Flink, Spark Streaming and so on. Among them, Spark Streaming is a popular streaming framework because of its scalability, high throughput and fault tolerance, and can be effectively combined with various offline frameworks.

According to its official documents, Spark Streaming has the characteristics of high scalability, high throughput and strong fault tolerance. Spark Streaming supports many data input sources, such as Kafka, Flume, Twitter, ZeroMQ and simple TCP sockets. After data input, you can use the highly abstract primitives of Spark, such as map, reduce, join, window and so on. The results can also be stored in many places, such as HDFS, database and so on. In addition, Spark Streaming can also be integrated with MLlib and Graphx. Its architecture is shown in the following figure:

The excellent features of Spark Streaming bring us a lot of application scenarios, such as website monitoring and network monitoring, anomaly monitoring, web page clicks, user behavior, user migration and so on. We will introduce in detail the technical architecture of Spark Streaming, two state models and Spark Streaming monitoring in our application scenario.

Second, application scenarios

In Spark Streaming, the unit of processing data is a batch rather than a single piece, but the data collection is carried out one by one, so the Spark Streaming system needs to set an interval to make the data aggregate to a certain amount and then operate together, this interval is the batch interval. Batch interval is the core concept and key parameter of Spark Streaming, which not only determines the frequency of Spark Streaming job submission and the delay of data processing, but also affects the throughput and performance of data processing.

2.1 Framework

At present, our Spark Streaming business application scenarios include anomaly detection, web page clicks, user behavior, user map migration and other scenarios. According to the calculation model, it can be divided into two kinds: stateless calculation model and state calculation model. In the actual application scenario, we use Kafka as the real-time input source, and Spark Streaming as the computing engine to process the data, then persist it to storage, including MySQL, HDFS, ElasticSearch and MongoDB. At the same time, Spark Streaming data is also written to Kafka after cleaning, and then persisted to HDFS; via Flume, and then do some UI presentation based on the persistent content. The architecture is shown in the following figure:

2.2 stateless model

The stateless model only focuses on the current newly generated DStream data, so the calculation logic is based on the data of the batch. The stateless model can well adapt to some application scenarios, such as website click real-time ranking, user access in specified batch time period, click situation and so on. Because there is no state in this model, it does not need to consider the stateful situation, but only needs to ensure that the data is not lost according to the business scenario. In this case, Direct is generally used to read Kafka data, and listeners are used to persist Offsets. The specific process is as follows:

The model framework on it includes the following processing steps:

Read Kafka real-time data

Spark Streaming Transformations and actions operation

Persist the data results to storage and jump to step 1.

Due to the influence of network, cluster and other factors, real-time programs fail for a long time, resulting in data accumulation. In this case, whether the accumulated data is consumed from the Kafka largest or the previous Kafka offsets depends on the specific business scenario.

2.3 State model

Stateful model means that DStreams has dependencies within a specified time range, which is specified by business scenarios and can be composed of 2 or more batch time RDD. Spark Streaming provides updateStateByKey methods to satisfy such business scenarios. Because of the problem of state, it is necessary to save the state of calculation in the actual calculation process, and the metadata and progress of calculation are saved through checkpoint in Spark Streaming. The application scenarios of the state model include the cumulative visit statistics of the specific modules of the website, the recent website visits of N batch time and the new cumulative statistics of app, and so on. The specific process is as follows:

In the above process, when calculating each batch time, you need to rely on the data in the last 2 batch time. After conversion and related statistics, it is finally persisted to the MySQL. However, in order to ensure that each calculation only calculates the data in 2 batch time, you need to maintain the state of the data and clear the expired data. Let's first take a look at the implementation of updateStateByKey, which is as follows:

Exposes methods of key types in global state data.

Def updateStateByKey [S: ClassTag] (updateFunc: (Iterator [(K, Seq [V], option [S])]) = > Iterator [(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean): DStream [(K, S)] = ssc.withScope {new StateDStream (self, ssc.sc.clean (updateFunc), partitioner, rememberPartitioner, None)}

Hides the key type in the global state data and provides only custom methods for Value.

Def updateStateByKey [S: ClassTag] (updateFunc: (Seq [V], Option [S]) = > Option [S], partitioner: Partitioner, initialRDD: RDD [(K, S)]): DStream [(K, S)] = ssc.withScope {val cleanedUpdateF = sparkContext.clean (updateFunc) val newUpdateFunc = (iterator: Iterator [(K, Seq [V], option2])) = > {iterator.flatMap (t = > cleanedUpdateF (t) .map (s = > (t.room1, s))} updateStateByKey (newUpdateFunc, partitioner, true, initialRDD)}

The above two methods provide us with ideas for cleaning up expired data respectively:

Generic K is filtered. K represents the corresponding key in the global status data. If K does not meet the specified conditions, false is returned.

Returns value filtering. The custom function in the second method specifies the Option [S] return value, and if the expired data returns None, the data will be cleared from the global state.

III. Spark Streaming monitoring

Like Spark, Spark Streaming also provides monitoring of Jobs, Stages, Storage, Enviorment, Executors and Streaming. The content of the Streaming monitoring page is shown below:

The figure above shows the trend of monitoring data provided by Spark UI, including real-time input data, Scheduling Delay, processing time and total delay. In addition, in addition to the above data monitoring, Spark UI also provides Active Batches and Completed Batches related information. Active Batches contains the batch information currently being processed and the accumulated batch-related information. Completed Batches has just provided the detailed data processed by each batch, including batch time, input size, scheduling delay, processing Time, Total Delay and so on. For more information, please see the figure below:

Spark Streaming can provide such elegant data monitoring because of the use of listener design patterns. If Spark UI can not meet your monitoring needs, users can customize personalized monitoring information. Spark Streaming provides StreamingListener characteristics, and by inheriting this method, you can customize the required monitoring, as follows:

@ DeveloperApi trait StreamingListener {/ * * Called when a receiver has been started * / def onReceiverStarted (receiverStarted: StreamingListenerReceiverStarted) {} / * * Called when a receiver has reported an error * / def onReceiverError (receiverError: StreamingListenerReceiverError) {} / * * Called when a receiver has been stopped * / def onReceiverStopped (receiverStopped: StreamingListenerReceiverStopped) {} / * * Called when a batch of jobs has been submitted for processing. * / def onBatchSubmitted (batchSubmitted: StreamingListenerBatchSubmitted) {} / * * Called when processing of a batch of jobs has started. * / def onBatchStarted (batchStarted: StreamingListenerBatchStarted) {} / * * Called when processing of a batch of jobs has completed. * / def onBatchCompleted (batchCompleted: StreamingListenerBatchCompleted) {} / * * Called when processing of a job of a batch has started. * / def onOutputOperationStarted (outputOperationStarted: StreamingListenerOutputOperationStarted) {} / * * Called when processing of a job of a batch has completed. * / def onOutputOperationCompleted (outputOperationCompleted: StreamingListenerOutputOperationCompleted) {}}

At present, when we save Offsets, we inherit StreamingListener, which is an application scenario. Of course, it is also possible to monitor the accumulation of real-time computing programs and send alarm emails when a threshold is reached. The customization of specific listeners also depends on the application scenario.

IV. Advantages and disadvantages of Spark Streaming

Spark Streaming is not like Storm, it is not a true streaming framework, but one batch of data at a time. It is in this way that other computing modules of Spark can be well integrated, including MLlib (machine learning), Graphx and Spark SQL. This brings great convenience to real-time computing, and at the same time, it also sacrifices the real-time performance of streaming.

4.1 benefits

Spark Streaming is based on Spark Core API, so it can maintain good compatibility with other modules in Spark and provide good extensibility for programming.

Spark Streaming is a coarse-grained quasi-real-time processing framework, which processes data after one read or asynchronous read, and its calculation can be based on large memory, so it has high throughput.

Spark Streaming adopts unified DAG scheduling and RDD, so it can make use of its lineage mechanism and has good fault tolerance support for real-time computing.

Spark Streaming's DStream is based on the abstraction of RDD in streaming data processing, and its transformations and actions have great similarity, which to a certain extent reduces the threshold for users to use. After they are familiar with Spark, they can quickly get started with Spark Streaming.

4.2 shortcomings

Spark Streaming is a quasi-real-time data processing framework, which uses a coarse-grained processing method, and computing will only be triggered when batch time arrives. This is not a pure-flow data processing method like Storm. In this way, the corresponding calculation delay will inevitably occur.

At present, there will still be some problems with the stability of Spark Streaming. Sometimes exit is caused by some inexplicable exceptions, in which case you have to ensure data consistency and failed restart functions.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.