In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
What this article shares with you is about how to carry out time travel and version management in the data Lake deltalake. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article.
Deltalake supports data versioning and time travel: provides data snapshots that allow developers to access and restore earlier versions of data for review, rollback, or recalculation.
1. Scene
Delta lake's time travel actually uses multi-version management mechanism to query historical delta table snapshots. Time travel has the following use cases:
1)。 You can repeatedly create data analyses, reports, or some output (for example, machine learning models). This is mainly conducive to debugging and safety review, especially in regulated industries.
2)。 Write complex time-based queries.
3)。 Fix the error message in the data.
4)。 Provides snapshot isolation for a set of queries to quickly change the table.
two。 Configuration
DataframeTable supports specifying the version information of a delta lake table when creating a dataframe:
Val df1 = spark.read.format ("delta") .option ("timestampAsOf", timestamp_string) .load ("/ delta/events") val df2 = spark.read.format ("delta") .option ("versionAsOf", version) .load ("/ delta/events")
For the version number, you can pass a version value directly, as follows:
Val df2 = spark.read.format ("delta") .option ("versionAsOf", 0) .table (tableName)
For timestamp strings, it must be in date format or timestamp format. For example:
Val df1 = spark.read.format ("delta"). Option ("timestampAsOf", "2020-06-28"). Load ("/ delta/events") val df1 = spark.read.format ("delta"). Option ("timestampAsOf", "2020-06-28T00:00:00.000Z"). Load ("/ delta/events")
Because the delta lake table is updated, there will be differences between the dataframe generated by multiple reads of the data, because the two reads of the data may be before the data update and after the data update. With time travel, you can repair data between multiple calls.
Val latest_version = spark.sql ("SELECT max (version) FROM (DESCRIBE HISTORY delta.` / delta/ events`)"). Collect () val df = spark.read.format ("delta"). Option ("versionAsOf", latest_version [0] [0]). Load ("/ delta/events")
3. Data preservation time
By default, deltalake keeps the submission history for the last 30 days. This means that you can specify a version 30 days ago to read the data, but there are some considerations:
3.1 the VACUUM function is not called on the delta table. The VACUUM function is used to delete delta tables that are not referenced and tables that exceed the retention time, and supports sql and API forms.
Slq expression:
VACUUM eventsTable-vacuum files not required by versions older than the default retention period
VACUUM'/ data/events'-- vacuum files in path-based table
VACUUM delta.` / data/events/ `
VACUUM delta.` / data/events/ `RETAIN 100 HOURS-- vacuum files not required by versions more than 100 hours old
VACUUM eventsTable DRY RUN-do dry run to get the list of files to be deleted
Scala API expression
Import io.delta.tables._
Val deltaTable = DeltaTable.forPath (spark, pathToTable)
DeltaTable.vacuum () / / vacuum files not required by versions older than the default retention period
DeltaTable.vacuum / / vacuum files not required by versions more than 100 hours old
Can be configured through the following two delta properties
Delta.logRetentionDuration = "interval": controls how long the history of the table is retained. Logs that are earlier than the retention interval are automatically cleared each time the checkpoint is written. If you set this configuration to a large enough value, many logs will be retained. This does not affect performance because the operation on the log is a constant time. The operation of history is parallel (but as the log size increases, it will become more time-consuming). The default value is interval 30 days.
Delta.deletedFileRetentionDuration = "interval": data within this time range will not be deleted by the VACUUM command. The default value is 7 days interval. To access 30 days of historical data, set delta.deletedFileRetentionDuration = "interval 30 days". This setting may cause your storage costs to rise.
Note: the VACUUM command does not delete log files, which are automatically deleted after checkpoint.
In order to read the data from the previous version, you must keep the log and data files for that version.
4. Case
Repair accidentally deleted user 111's data.
INSERT INTO my_table SELECT * FROM my_table TIMESTAMP AS OF date_sub (current_date (), 1) WHERE userId = 111,
Fix erroneous updated data
MERGE INTO my_table target USING my_table TIMESTAMP AS OF date_sub (current_date (), 1) source ON source.userId = target.userId WHEN MATCHED THEN UPDATE SET *
Enquire about the number of new consumers in the past seven days:
SELECT count (distinct userId) FROM my_table TIMESTAMP AS OF date_sub (current_date (), 7)) the above is how to do time travel and version management in the data Lake deltalake. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.