Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the characteristics of Spark Structured Streaming

2025-03-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article shows you what the characteristics of Spark Structured Streaming are, the content is concise and easy to understand, and it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

The following introduces the basic concept of Structured Streaming, and its characteristics in storage, automatic streaming, fault tolerance, performance and so on, as well as the processing mechanism of event time, and finally brings some practical application scenarios.

First of all, the problems and concepts faced by TD convection processing are explained clearly. TD mentioned that because flow processing has the following significant complexity characteristics, it is difficult to establish a very robust process:

First, there are different formats of data (Jason, Avro, binary), dirty data, untimely and unordered.

Second, the complex loading process, the process based on event time needs to support interactive query, combined with machine learning.

Third, different storage systems and formats (SQL, NoSQL, Parquet, etc.) should consider how to tolerate faults.

Because it can run on the Spark SQL engine, Spark Structured Streaming naturally has Spark advantages such as good performance, good scalability and fault tolerance. In addition, it also has a rich, unified, high-level API, so it is easy to deal with complex data and workflows. In addition, both Spark itself and its integrated multiple storage systems have a rich biosphere. These advantages also allow Spark Structured Streaming to get more development and use.

Flow is defined as an infinite table (unbounded table), which appends new data from the data stream to this infinite table, and its query process can be broken down into several steps, such as reading JSON data from Kafka, parsing JSON data, storing it in a structured Parquet table, and ensuring an end-to-end fault tolerance mechanism. The features include:

Support a variety of message queues, such as Files/Kafka/Kinesis.

You can use join (), union () to connect to multiple different types of data sources.

Returns a DataFrame with the structure of an infinite table.

You can choose SQL (BI analysis), DataFrame (data scientist analysis) and DataSet (data engine) on demand, and they all have almost the same semantics and performance.

Convert the records of Kafka's JSON structure to String to generate nested columns, using many optimized handlers to accomplish this action, such as from_json (), and allow various custom functions to assist in processing, such as Lambdas, flatMap.

You can write to an external storage system, such as Parquet, in the Sink step. In Kafka sink, foreach is supported to do any processing on the output data, transaction and exactly-once mode are supported.

Support fixed interval micro-batch processing, with high performance of micro-batch processing, support low-latency continuous processing (Spark 2.3), support checkpoint mechanism (check point).

Processing structured source data from Kafka in seconds can be well prepared for the query.

Spark SQL converts batch queries into a series of incremental execution plans so that data can be manipulated in batches.

In the fault-tolerant mechanism, Structured Streaming adopts a checkpoint mechanism to write the progress offset into the storage of stable, and saves it in the way of JSON to support backward compatibility, allowing recovery from any error point (such as automatically adding a filter to deal with interrupted data). This ensures the exactly-once of end-to-end data.

In terms of performance, Structured Streaming reuses the Spark SQL optimizer and Tungsten engine, and the cost is reduced by three times! More information can be found in the author's blog.

The Structured Streaming isolation processing logic is configurable (such as customizing the input data format of JSON), and it is easy to identify whether the execution is a batch or a stream query. At the same time, TD also compares the delay, throughput and resource allocation of batch processing, micro-batch-flow processing and persistent flow processing.

In terms of time window support, Structured Streaming supports aggregation based on event time (event-time), which makes it easier to understand what happens at regular intervals. It also supports a variety of user-defined aggregate functions (User Defined Aggregate Function,UDAF). In addition, Structured Streaming can aggregate the state of distributed storage between different triggers, the state is stored in memory, and the archiving uses HDFS's Write Ahead Log (WAL) mechanism. Of course, Structured Streaming can also automatically process outdated data and update the old saved state. Because historical state records can grow indefinitely, which can cause some performance problems, to limit the size of state records, Spark uses watermarks (watermarking) to delete old aggregated data that is no longer updated. Allows support for custom state functions, such as timeouts for events or processing times, as well as Scala and Java.

TD also gave specific examples of the application of stream processing in his speech. In Apple's information security platform, millions of events will be generated per second. Structured Streaming can be used for defect detection. The following figure shows the architecture of the platform:

In this architecture, first, any original log can be loaded into the structured log library through ETL, and disaster recovery can be done quickly through batch control; second, a lot of other data information (DHCP session, slowly changing data) can be connected. Third, it provides a variety of mixed working modes: real-time warning, historical report, ad-hoc analysis, unified API allows to support a variety of analysis (such as real-time alarm system), support rapid deployment. Fourth, it has reached the processing performance of millions of events in seconds.

The above content is what are the characteristics of Spark Structured Streaming? have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report