How Databricks uses Spark Streaming and Delta Lake to monitor the data quality of streaming data 07/15 Update SLTechnology News&Howtos

How Databricks uses Spark Streaming and Delta Lake to monitor the data quality of streaming data

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

What this article shares with you is about how Databricks uses Spark Streaming and Delta Lake to monitor the data quality of streaming data. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article.

The editor mainly introduces the method and architecture of how Databricks uses Spark Streaming and Delta Lake to monitor the data quality of streaming data. The following discusses a data management architecture, which can detect damaged or bad data in streaming data through active monitoring and analysis when the data arrives without causing a bottleneck.

Build a streaming data analysis and monitoring process

At Databricks, we see a growing number of data processing patterns among our customers, and the emergence of these new patterns pushes the limits of possibilities, including speed and quality issues. To help resolve this contradiction, we began to consider using the right tools that not only support the required data speed, but also provide an acceptable level of data quality. Structured Streaming and Delta Lake are well suited for data acquisition and storage layers because they can work together to create a scalable, fault-tolerant, class-real-time system with exactly-once processing guarantees.

It is more difficult to find an acceptable tool for enterprise data quality analysis, especially since this tool requires the ability to summarize the status of data quality indicators. In addition, you need to be able to check the entire dataset (such as what percentage of records are null), which will increase the computational cost as the amount of data extracted increases. This is required for all streaming systems, and this requirement excludes many available tools.

In our initial solution, we chose Amazon's data quality inspection tool Deequ because it can provide simple and powerful API, the ability to aggregate data quality indicators, and support for Scala. In the future, other Spark native tools will provide additional options.

Realization of streaming data quality Monitoring

We simulate the data flow by running a small Kafka producer on the EC2 instance, which writes the simulated stock trading information to Kafka topic and imports the data into the Delta Lake table using the native Databricks connector. In order to demonstrate the function of data quality check in Spark Streaming, we chose to implement the different functions of Deequ throughout the process:

Generate constraints based on historical data

Incremental quality analysis of arrived data using foreachBatch operator

Use the foreachBatch operator to perform (smaller) unit tests on the incoming data and isolate the poor quality batch into the poor quality record table

For each arriving batch, write the latest status metrics to the Delta table

Periodically perform (large) unit tests on the entire dataset and track the results in MLFlow

Send a notification based on the verification result (e.g. by email or Slack)

Capture metrics in MLFlow for visualization and recording.

We combined MLFlow to track the quality of data performance metrics over time, version iterations of Delta tables, and a Slack connector for notifications and alerts. The whole process can be represented by the following picture:

Because there is a unified batch / streaming interface in Spark, we can extract reports, alarms, and metrics anywhere in the process as real-time updates or batch snapshots. This is particularly useful for setting triggers or limits, so if a metric exceeds a threshold, data quality improvement measures can be performed. Also note that we have no impact on the original data that initially arrived, which will be submitted to our Delta table immediately, which means that we will not limit the rate of data entry. Downstream systems can read data directly from the table and may be interrupted if any of the above trigger conditions or quality thresholds are exceeded. In addition, we can easily create a view that excludes poor quality records to provide a clean table.

At a higher level, the code that performs our data quality tracking and validation is as follows:

Spark.readStream.table ("trades_delta") .writeStream.foreachBatch {(batchDF: DataFrame, batchId: Long) = >

/ / reassign our current state to the previous next state val stateStoreCurr = stateStoreNext

/ / run analysis on the current batch, aggregate with saved state val metricsResult = AnalysisRunner.run (data=batchDF,...) / / verify the validity of our current microbatch val verificationResult = VerificationSuite () .onData (batchDF) .addCheck (...) .run ()

/ / if verification fails, write batch to bad records table if (verificationResult.status! = CheckStatus.Success) {.}

/ / write the current results into the metrics table Metric_results.write .format ("delta") .mode ("overwrite") .saveAsTable ("deequ_metrics")} .start ()

Use the data quality tool Deequ

Using Deequ in Databricks is relatively easy. You need to define an analyzer first, and then run the analyzer on dataframe. For example, we can track several related metric checks provided locally by Deequ, including checking whether the quantity and price are non-negative, whether the original IP address is not empty, and the uniqueness of the symbol field in all transactions. Deequ's StateProvider object is particularly useful in streaming data configurations, allowing users to keep the state of our metrics in memory or disk and summarize them later. This means that each processed batch analyzes only the data records in that batch, not the entire table. Even as the data size increases, this keeps performance relatively stable, which is important in a long-running production environment, which needs to be consistent on any amount of data.

MLFlow also does a good job of tracking metrics over time. In our notebook, we track all Deequ constraints analyzed in foreachBatch code as metrics and use Delta's versionID and timestamp as parameters. In Databricks's notebook, integrated MLFlow services are particularly convenient for metrics tracking.

By using Structured Streaming, Delta Lake, and Deequ, we can eliminate the traditional tradeoff between data quality and speed and focus on achieving an acceptable level of both. Flexibility is particularly important here-not only in how to deal with bad records (isolation, error reporting, alarms, etc.), but also in architecture (such as when and where checks are performed? ) and ecologically (how do we use our data? ). Open source technologies such as Delta Lake, Structured Streaming, and Deequ are key to this flexibility. With the development of technology, being able to use the latest and most powerful solutions is the driving force to enhance its competitive advantage. Most importantly, the speed and quality of your data must not be opposed to each other, but should be consistent, especially as streaming data processing gets closer and closer to core business operations. Soon, this will not be a choice, but an expectation and requirement, and we are moving forward step by step in this future direction.

This is how Databricks uses Spark Streaming and Delta Lake to monitor the data quality of streaming data. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.