How to deal with Spark Structured Streaming efficiently how RunOnceTrigger works 07/01 Update SLTechnology News&Howtos

How to deal with Spark Structured Streaming efficiently how RunOnceTrigger works

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

It is believed that many inexperienced people are at a loss about how to deal with RunOnceTrigger efficiently in Spark Structured Streaming. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Traditionally, when people think of stream processing, words such as "real-time", "2407" or "always on" come to mind. You may encounter this situation in production where data only arrives at regular intervals, such as every hour, or every day. For these cases, incremental processing of these data is still beneficial. But running a 24-7 Streaming job in a cluster is a bit wasteful, and only a small amount of processing per day is needed to benefit.

Fortunately, by using Structured Streaming's Run Once trigger feature in version 2.2 of spark, you can gain both the benefits of Catalyst Optimizer and the cost savings of running idle job in the cluster.

One, Structured Streaming's Triggers

In Structured Streaming, Trigger is used to specify how often Streaming queries produce results. Once Trigger is triggered, Spark will check to see if new data is available. If new data is available, the query will be executed incrementally from where it was last triggered. If no new data is available, Stream continues to sleep until the next time Trigger is triggered.

The default behavior of Structured Streaming runs with as little latency as possible, and trigger runs immediately after the last trigger trigger. For some use cases with low latency requirements, Structured Streaming supports ProcessingTime trigger and will soon use user-provided intervals, such as every minute, to trigger a query.

This is good, but it is inevitable to run 24-7. Instead, RunOnce Trigger executes the query only once and then stops the query.

Trigger is specified when you start Streams.

Import org.apache.spark.sql.streaming.Trigger

/ / Load your Streaming DataFrame

Val sdf = spark.readStream.format ("json") .schema (my_schema) .load ("/ in/path")

/ / Perform transformations and then write...

Sdf.writeStream.trigger (Trigger.Once) .format ("parquet") .start ("/ out/path")

Second, the efficiency of RunOnce compared with Batch

1,Bookkeeping

When running a batch job that performs incremental updates, it is common to process which data is updated, which should be processed, and which should not be processed. Structured Streaming has done all this for you, and when dealing with general streaming applications, you should only care about business logic, not low-level Bookkeeping.

2, table atomicity

The most important nature of big data's processing engine is how it tolerates mistakes and failures. ETL jobs may (and actually often) fail. If your work fails, you need to make sure that your work output is cleaned up, otherwise you will get repetitive or junk data after your next successful job. When writing file-based tables using Structured Streaming, Structured Streaming submits all files created by each job to log after each successful departure. When Spark rereads the table, it uses log to identify which files are valid. This ensures that garbage introduced due to failure will not be consumed by downstream applications.

3. Praise the status operation of runs

If your data flow may produce duplicate records, but if you want to implement semantics once, how do you do it in batch processing? With Structured Streaming, you can use dropDuplicates () to remove weight. Configure the watermark to be long enough to contain several Streaming job runs to ensure that you won't boast that runs handles duplicate data.

4, cost saving

Running a 247 Streamingjob is wasteful. There may be cases where some delay in data calculation is acceptable, or the data itself is generated on an hourly or daily basis. In order to get all the benefits described above in Structured Streaming, you may need to occupy the cluster to run programs all the time, but now, using Trigger, which is executed only once, you don't have to occupy the cluster all the time.

After reading the above, have you mastered how Spark Structured Streaming handles RunOnceTrigger efficiently? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.