How to analyze the evolution of delta lake schema 07/04 Update SLTechnology News&Howtos

How to analyze the evolution of delta lake schema

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze delta lake table schema evolution, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

The main purpose of the following is to delve into the schema evolution of delta lake.

Data, like our experience, is always developing and accumulating. In order to keep up with the times, our mode of thinking must adapt to new data, some of which contain new dimensions-a new way to see things we have never thought of before. These mindsets are no different from the schema of a table, they define how we classify and process new information.

With the development of business problems and requirements, the structure of the data will change. With Delta Lake, it becomes easier to merge new dimensions as the data changes. Users can use simple semantics to control the schema of their tables. These tools include schema validation (which prevents users from inadvertently contaminating their tables due to errors or junk data) and schema evolution (that is, adding new columns to enrich the data).

Learn about table schema

Each DataFrame in Apache Spark ™contains a schema that defines the format of the data, such as data types and columns, as well as metadata. With Delta Lake, the schema of the table is saved in the transaction log in JSON format.

What is schema check?

Schema checking is a security measure in Delta Lake that ensures data quality by rejecting writes that do not match the schema of the table. Just as the front desk manager of a busy restaurant only accepts reservations, it checks whether each column in the data inserted into the table is in the list of its expected columns (in other words, whether each column has a "reservation"). And reject all writes to columns that are not in the list.

How does schema check work?

Delta Lake uses schema validation on write operations, which means that when writing, all new writes to the table are checked for compatibility with the schema of the target table. If the schema is not compatible, the Delta Lake completely cancels the transaction (no data is written) and throws an exception to let the user know the mismatch.

To determine whether writes to the table are compatible, Delta Lake uses the following rules. DataFrame to write:

Cannot contain any other columns that do not exist in the schema of the target table. On the contrary, it is possible that the input data does not contain some columns in the table, which will be simply assigned as null values.

The data type of a column cannot be different from the column data type in the target table. If the columns of the target table contain StringType data, but the corresponding columns in DataFrame contain IntegerType data, schema enforcement throws an exception and prevents writes.

Cannot contain column names that differ only in case. This means that columns such as "Foo" and "foo" cannot be defined in the same table. Although Spark can be used for case-sensitive or case-insensitive (default) mode, Delta Lake retains case but does not store schema case. Parquet is case sensitive when storing and returning column information. This restriction is added to avoid potential errors, data corruption, or loss.

To illustrate, take a look at the code below to see what happens when you try to append some newly calculated columns to an delta lake table that is not compatible with them.

# Generate a DataFrame of loans that we'll append to our Delta Lake tableloans = sql ("" SELECT addr_state, CAST (rand (10) * count as bigint) AS count, CAST (rand (10) * 10000 * count AS double) AS amount FROM loan_by_state_delta "")

# Show original DataFrame's schemaoriginal_loans.printSchema ()

"root |-- addr_state: string (nullable = true) |-- count: integer (nullable = true)"

# Show new DataFrame's schemaloans.printSchema ()

"" root |-- addr_state: string (nullable = true) |-- count: integer (nullable = true) |-- amount: double (nullable = true) # new column ""

# Attempt to append new DataFrame (with new column) to existing tableloans.write.format ("delta")\ .mode ("append")\ .save (DELTALAKE_PATH)

"" Returns:

A schema mismatch detected when writing to the Delta table.

To enable schema migration, please set:'.option ("mergeSchema", "true")\'

Table schema:root-- addr_state: string (nullable = true)-- count: long (nullable = true)

Data schema:root-- addr_state: string (nullable = true)-count: long (nullable = true)-amount: double (nullable = true)

If Table ACLs are enabled, these options will be ignored. Please use the ALTER TABLE command for changing the schema.

Delta Lake does not automatically add new columns, but forces schema validation and prevents writes. To help determine which columns are causing the mismatch, Spark prints out two kinds of schema in the stack trace for comparison.

What is the use of pattern checking?

Because this check is very strict, the data can be used directly in the production environment. Common usage scenarios are as follows:

Machine learning algorithm

BI dashboard

Data analysis and visualization tools

Any production system that requires highly structured, strongly typed semantic schema

Prevent data sparse

Mandatory schema checking may cause people to be more constrained when writing spark tasks, and they will collapse when they encounter schema incompatible tasks, which may be a headache.

However, assuming that schema is not checked, new columns may be added at any time, causing the table to become more and more sparse. In fact, this is also a kind of performance consumption.

Therefore, schema parity can also prevent data from becoming more and more sparse.

What is the evolution of schema?

The evolution of schema is simply that the schema of a table changes as the data changes. Most commonly, it is used to automatically adjust the schema to include one or more new columns when performing attach or override operations.

How does the evolution of schema work?

The configuration is simple, starting the schema evolution by adding .option ('mergeSchema',' true') to your .write or .writeStreamSpark command.

# Add the mergeSchema optionloans.write.format ("delta")\ .option ("mergeSchema", "true")\ .mode ("append")\ .save (DELTALAKE_SILVER_PATH)

Execute the following sql expression:

# Create a plot with the new column to confirm the write was successful%sqlSELECT addr_state, sum (`quantit`) AS amountFROM loan_by_state_deltaGROUP BY addr_stateORDER BY sum (`quantit`) DESC LIMIT 10

You can draw a statistical chart:

All columns that are set by mergeSchema to exist in true,DataFrame but not in the target table are automatically added to the end of the schema as part of the write transaction. You can also add nested fields, and these fields are also added to the end of their respective struct columns.

Data engineers and scientists can use this option to add new columns (perhaps newly tracked metrics, or columns of this month's sales figures) to their existing machine learning production tables without breaking existing models that rely on old columns.

The following types of schema changes can be used for schema evolution during table appending or overwriting:

Add a new column (this is the most common case)

Change the data type from NullType- > any other type, or from ByteType- > ShortType- > IntegerType

Other changes that are not suitable for schema evolution require that schema and data be overwritten by adding .option ("overwriteSchema", "true"). For example, in cases where the "Foo" column is originally an integer data type and the new schema will be a string data type, you need to rewrite all Parquet (data) files. These changes include:

Delete column

Change the data type of an existing column

Rename column names that vary only by case (for example, "Foo" and "foo")

Finally, in Spark 3.0, ALTER TABLE will fully support explicit DDL, allowing users to do the following on table schema:

Add column

Change column comments

Set table properties that define table behavior, such as setting the retention period for transaction logs

What is the use of schema evolution?

You can use schema evolution when you plan to change the schema of a table. This is the easiest way to migrate the schema because it automatically adds the correct column names and data types without explicitly declaring them.

Schema validation rejects any new columns or other schema changes that are incompatible with the table. By setting and complying with these high standards, analysts and engineers can believe that their data is of the highest integrity and can reason clearly, enabling them to make better business decisions.

On the other hand, the schema evolution complements the mandatory parity of schema by making it easier for schema to automatically occur. After all, it is not difficult to add a column.

Schema verification is the core of architecture evolution. When used together, these functions are easier than ever to prevent noise.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.