In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
What this article shares with you is the analysis of how to carry on the data lake deltalake. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it with the editor.
Introduction to 1.delta Featur
Delta Lake is the storage middle layer with Schema information data between Spark computing framework and storage system. It brings three main functions to Spark:
First, Delta Lake enables Spark to support data update and deletion functions
Second, Delta Lake enables Spark to support transactions
Third, support data version management, run user query historical data snapshot.
Core characteristics
ACID transactions: provide ACID transactions for data lakes to ensure data integrity when multiple data channels read and write data concurrently.
Data versioning and time travel: provides snapshots of data that allow developers to access and restore earlier versions of data for review, rollback, or reproduce experiments
Scalable metadata management: stores metadata information of tables or files, and treats metadata as data, and the corresponding relationship between metadata and data is stored in the transaction log
Unified streaming and batch processing: tables in Delta range from batch to streaming and sink
Data operation audit: the transaction log records the details of each change made to the data, providing a complete audit trail of the changes
Schema management function: provides the ability to automatically verify that the Schema written to the data is compatible with the Schema of the table, and the ability to display additional columns and automatically update the Schema
Datasheet operations (similar to SQL of traditional databases): merge, update, delete, etc., providing Java/scala API that is fully compatible with Spark
Uniform format: all data and metadata in Delta are stored as Apache Parquet.
The feature implementation of Delta is based on transaction logging, such as ACID transaction management, data atomicity, metadata processing, and time travel.
To put it bluntly, Delta Lake is just a lib library.
Delta Lake is a lib rather than a service, and unlike HBase, it does not need to be deployed separately, but is directly attached to the computing engine. Currently only the Spark engine is supported. What does that mean? There is no difference between Delta Lake and ordinary parquet files, as long as you introduce the delta package into your Spark code project and follow the standard Spark datasource operation, it can be said that the deployment and use cost is extremely low.
Delta Lake Real content screen
Parquet file + Meta file + API = Delta Lake for a set of operations.
So Delta is no mystery, no different from parquet. But he provides support for many features and functions through meta files and the corresponding API. The only difference between using it in Spark and using parquet is to replace format parquet with detla.
As can be seen in the above picture, the data lake provides a variety of data services in an one-stop way.
2.delta test
The Spark version 3.0 is used, and the delta version 0.7 is used for testing. The first step is to import dependencies:
Io.delta delta-core_2.12 0.7.0
It is also easy for spark to use delta, just like using data formats such as json,csv, simply passing a delta string into the format function. For example, to create a table, scala is expressed as follows:
Val data = spark.range (0,5) data.write.format ("delta") .save ("tmp/delta-table")
Schema information, which he will infer from dataframe.
Read a table
Spark.read.format ("delta"). Load ("tmp/delta-table"). Show
Delta lake's api is basically the same for spark, with little change. The underlying layer of delta is completely based on spark and can support real-time and offline. It is also possible for scenarios with more reads and fewer updates and multiple batches of updates.
The above is how to carry on the data lake deltalake analysis, the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.