Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize the DDL operation in the data lake DeltaLake

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article shows you how to achieve the DDL operation in the data Lake DeltaLake. The content is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

I talked about the brief introduction, features and basic operation of delta lake earlier. This article is mainly about the DDL operation of DeltaLake, which actually depends on spark datasourcev2 and catalog API (3.0 +), so when Deltalake integrates spark, it's best to start with 3.0, which has recently been released.

There are some requirements for creating sparksession, and you need to add two configurations:

Valspark = SparkSession .builder () .appName (this.getClass.getCanonicalName) .master ("local [2]") .config ("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config ("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") .getOrCreate ()

1. Create a tabl

Deltalake creates a table in two ways:

1)。 DataFrameWriter, which is no stranger to you, spark defaults to the way you write files.

Df.write.format ("delta") .saveAsTable ("events") / / create table in the metastore

Df.write.format ("delta") .save ("/ delta/events") / / create table by path

2)。 DeltaLake also supports the use of spark sql's new DDL operation to create tables, CREATE TABLE.

-- Create table in the metastoreCREATE TABLE events (date DATE, eventId STRING, eventType STRING, data STRING) USING DELTA

When you create a table in metastore using Deltalake, the location information of the data is recorded in metastore. This benefit is obvious, it is easier to find when others use it, and you don't have to pay attention to the real location of the data. However, metastore does not store whether the data content is valid or not.

two。 Data partition

When building a data warehouse in production, the data will be partitioned, the query will be accelerated and the DML operation will be optimized. When you use Delta lake to create a partition table, you only need to specify a partition column. Here is an example of a common time-by-time partition:

1) .DDL operation

-Create table in the metastoreCREATE TABLE events (date DATE, eventId STRING, eventType STRING, data STRING) USING DELTAPARTITIONED BY (date) LOCATION'/ delta/events'

2). Scala API

Df.write.format ("delta"). PartitionBy ("date"). SaveAsTable ("events") / / create table in the metastoredf.write.format ("delta"). PartitionBy ("date"). Save ("/ delta/events") / / create table by path

3. Specify storage location

In fact, we can control the storage location of Delta lake table data files, and we can specify path when writing DDL.

In fact, this is very similar to the function of the external table of hive. The table of delta lake at the specified location can be regarded as not managed by metastore. When this kind of table is deleted, the data will not be actually deleted.

Suppose that when creating the Delta lake table, the data file already exists in the specified path, and delta lake will do the following things when creating the table:

1)。 If you only specify the table name and path when you create it, as follows:

CREATE TABLE eventsUSING DELTALOCATION'/ delta/events'

Hive metastore's table automatically infers the schema,partition, and attributes from the existing data. This feature can be used to import data into metastore.

2)。 Assuming that you specify some configuration (schema,partition, or properties of the table), delta lake will only recognize the configuration information you specify from the existing data, not all of the configuration. Assuming that the configuration you specify does not exist in the existing data, an inconsistent exception will be thrown.

3. Read data

Data can directly support sql query, old spark users can also directly use dataframe api to query data.

Sql query

SELECT * FROM events-- query table in the metastore

SELECT * FROM delta.` / delta/ events`-- query table by path

Dataframe query

Spark.table ("events") / / query table in the metastore

Spark.read.format ("delta") .load ("/ delta/events") / / create table by path

Dataframe automatically reads the latest snapshots of data, and users do not need to refresh table. When predicate pushdown is available, delta lake automatically uses dividers and statistics to optimize the query, thereby reducing data loading.

4. Write data

A). Append

Spark's own append mode can be used to append data to existing tables:

Df.write.format ("delta"). Mode ("append"). Save ("/ delta/events") df.write.format ("delta"). Mode ("append"). SaveAsTable ("events")

Of course, delta also supports insert into:

INSERT INTO events SELECT * FROM newEvents

B). Overwrite

Delta lake also supports overwriting the entire table directly, using overwrite mode directly.

Dataframe api is as follows:

Df.write.format ("delta"). Mode ("overwrite"). Save ("/ delta/events") df.write.format ("delta"). Mode ("overwrite"). SaveAsTable ("events")

The SQL API format is as follows:

INSERT OVERWRITE events SELECT * FROM newEvents

When using Dataframe, you can also support overwriting only the data of the specified partition. The following example covers only January data:

Df.write .format ("delta") .mode ("overwrite") .option ("replaceWhere", "date > = '2017-01-01' AND date

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report