SparkStreaming basic theory 07/02 Update SLTechnology News&Howtos

SparkStreaming basic theory

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

First, the introduction of Spark Streaming (1) Why should there be Spark Streaming?

MapReduce and Spark SQL of Hadoop can only perform offline computing, which can not meet the real-time business requirements, such as real-time recommendation, real-time website performance analysis and so on. Streaming computing can solve these problems. At present, there are three commonly used streaming computing frameworks, which are Storm,Spark Streaming and fink.

(2) what is Spark Streaming?

Spark Streaming, in fact, is a framework provided by Spark for real-time computing of big data. Its underlying layer, in fact, is also based on the Spark Core we explained earlier. The basic computing model is still the memory-based big data real-time computing model. Moreover, its underlying core component is the RDD that we often use in Spark Core. According to the characteristics of real-time computing, a layer of encapsulation is carried out on top of RDD, which is called DStream. The underlying layer is still based on RDD. Therefore, RDD is the core of the whole Spark technology ecology.

Spark streaming supports many data input sources, such as Kafka, Flume, Twitter, ZeroMQ and simple TCP sockets. After entering the data, you can use spark's highly abstract language: map, reduce, join, window and so on. And the results can be stored in many places. Such as HDFS, database, etc. In addition, spark streaming can also be perfectly integrated with MLlib (machine learning) and Graphx.

(3) advantages of Spark Streaming

Easy to use

Fault tolerance:

Seamless integration with spark system

II. The core concepts of Spark Streaming

receives the real-time input data stream, and then splits the data into multiple batch, for example, each second of collected data is encapsulated into a batch, and then each batch is handed over to Spark's computing engine for processing. Finally, a result data stream is produced, in which the data is also composed of a batch.

(1) introduction of related terms:

discrete stream DStream: this is sparkstreaming's abstract description of the internal continuous real-time data flow, that is, a real-time data we are dealing with, corresponding to a DStream instance in sparkstreaming.

batch data: this is the first step of dividing the real-time data into parts, abstracting the real-time data, dividing the data into batches in time slices, transforming the stream processing into time slices, batch processing of the data, and with the passage of time, these processing results form the corresponding result data stream.

time slice or batch processing time interval: a quantitative standard for artificially convection data, using time slices as the basis for us to split stream data. The data of a time slice corresponds to an instance of RDD.

Window length: the length of time of the data stream covered by a window must be a multiple of the batch interval.

Sliding period: the length of time between the previous window and the next window must be a multiple of the batch interval.

InputDStream: an InputDStream is a special DStream that represents raw data that is loaded into a real-time data stream for the first time.

(2) introduction to DStream:

Discretized Stream is the basic abstraction of Spark Streaming, which represents the persistent data flow and the resulting data flow after various Spark primitive operations. In terms of internal implementation, DStream is represented by a series of consecutive RDD. DStream is a discrete representation of continuous data. Each discrete segment in DStream is a RDD,DStream that can be transformed into another DStream.

DStream also operates on data in terms of RDD:

1) related operations of DStream:

The primitives on DStream are similar to RDD, with Transformations (transformation) and Output Operations (output, similar to action).

Since the operation of DStream is very similar to that of RDD, and the underlying layer of DStream is encapsulated RDD, so here is a brief introduction to Transformations.

Note: there are several extremely important operations in the Transformations operation: updateStateByKey (), transform (), window (), and foreachRDD (). The following blog posts will be introduced in detail.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.