How to use Spark for Real-time Stream Computing 07/04 Update SLTechnology News&Howtos

How to use Spark for Real-time Stream Computing

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about how to use Spark for real-time streaming computing. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

Spark Streaming VS Structured Streaming

Spark Streaming is the original streaming framework of Spark, which uses the form of microbatches for streaming.

RDDs-based Dstream API is provided, and the data in each time interval is a RDD, and the RDD is processed continuously to realize stream computing.

Apache Spark launched the Structured Streaming project in 2016, a new Spark SQL-based stream computing engine, Structured Streaming, which allows users to write high-performance stream programs as easily as batch programs.

Structured Streaming is a new real-time streaming framework proposed by the Spark2.0 version (2.0 and 2.1 are experimental versions and stable versions starting from Spark2.2)

Since the Spark-2.X version, Spark Streaming has entered maintenance mode, seeing that Spark has devoted most of its energy to the new Structured Streaming, and some new features are available only to Structured Streaming, so that Spark has the ability to fight Flink.

1. Insufficient Spark Streaming

Processing Time instead of Event Time

First of all, Processing Time is the time when the data arrives at the Spark and processed, while Event Time is the attribute of the data, which generally indicates the time when the data was generated from the data source. For example, in IoT, if the sensor produces a piece of data at 12:00:00 and then transmits it to Spark at 12:00:05, Event Time is 12:00:00 and Processing Time is 12:00:05. We know that Spark Streaming is an micro-batch schema based on the DStream model, which simply processes the current batch of stream data for a small period of time, such as 1s. If we want to count some data for a certain period of time, there is no doubt that we should use Event Time, but because the data cutting of Spark Streaming is based on Processing Time, it is particularly difficult to use Event Time.

Complex, low-level api

It is easy to understand that the API provided by DStream (Spark Streaming's data model) is similar to RDD's API, which is very low level. When we write Spark Streaming programs, we essentially want to construct the DAG execution diagram of RDD, and then run it through Spark Engine. One problem with this is that DAG may make a big difference in execution efficiency due to the uneven level of developers. This leads to a very bad experience for developers, which any basic framework does not want to see (the slogan of the basic framework is: just focus on your own business logic and leave the rest to me). This is one of the reasons why many basic systems emphasize Declarative.

Reason about end-to-end application

End-to-end here refers to direct input to out, for example, Kafka connects to Spark Streaming and then exports to HDFS. DStream can only guarantee its own consistent semantics is exactly-once, while the semantics of input access Spark Streaming and Spark Straming output to external storage often need to be guaranteed by users themselves. This semantic guarantee is also very challenging to write, for example, in order to ensure that the semantics of output is exactly-once semantics, the storage system of output needs to have idempotent features, or to support transactional writing, which is not an easy task for developers.

Batch code is not uniform

Although batch streams are originally two systems, it is really necessary to unify the two systems, and we do sometimes need to run our flow processing logic on batch data. On this point, Google first criticized the term streaming/batch when it proposed the Dataflow computing service in 2014, but put forward the concept of unbounded/bounded data. Although DStream is an encapsulation of RDD, there is still a bit of work to do to completely convert DStream code into RDD, not to mention the fact that Spark batches now use DataSet/DataFrame API.

2. Advantages of Structured Streaming

In contrast, take a look at the advantages of Structured Streaming:

A concise model. Structured Streaming's model is concise and easy to understand. Users can directly think of a stream as a table of infinite growth.

Consistent API. Because most of the API is shared with Spark SQL, users who are familiar with Spaprk SQL are easy to use and the code is very concise. At the same time, batch processing and stream processing programs can also share code, and there is no need to develop two different sets of code, which significantly improves the development efficiency.

Excellent performance. Structured Streaming not only shares API with Spark SQL, but also directly uses Spark SQL's Catalyst optimizer and Tungsten, which has excellent data processing performance. In addition, Structured Streaming can directly benefit from various performance optimizations of future Spark SQL.

Multilingual support. Structured Streaming directly supports the languages currently supported by Spark SQL, including Scala,Java,Python,R and SQL. Users can choose their favorite language for development.

It can also support the input and output of multiple data sources, such as Kafka, flume, Socket, Json.

Based on Event-Time, Processing-Time based on Spark Streaming is more accurate and more in line with business scenarios.

Event time event time: is the time when the data actually occurs. For example, a user browsing a page may generate a user's browsing log at that point in time.

Process time processing time: the log data actually reaches the point in time when it is processed in the computing framework, in short, when your Spark program reads the log.

The event time is the time embedded in the data itself. For many applications, users may want to operate at this event time. For example, if you want to get the number of events generated by an IoT device per minute, you may need to use the time it took to generate the data (that is, the event time in the data), rather than the time that Spark received them. The event time is naturally represented in this model-each event from the device is a row in the table, and the event time is a column value in that row.

Dataframe processing for spark2 is supported.

It solves the problem that the code upgrade in Spark Streaming, the task failure caused by the change of DAG diagram and the inability to resume transmission at breakpoint.

The scalable and fault-tolerant streaming data processing engine based on SparkSQL enables real-time streaming data computing to use the same processing method (DataFrame&SQL) as offline computing.

Stream computation can be expressed in the same way as static data batch calculation.

The underlying principle is completely different.

Spark Streaming adopts micro-batch processing method. Each batch interval is a batch, that is, a RDD, we can operate on the RDD can continue to receive and process data.

Structured Streaming treats real-time data as continuously appended tables. Each piece of data on the stream is similar to adding a row of new data to the table.

The above is the editor for you to share how to use Spark for real-time stream calculation, if you happen to have similar doubts, you can refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.