Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

An example analysis of the integration of lakes and warehouses in AWS using the data lake format

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

AWS lake warehouse integrated use of data lake format for example analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Friends have been asking if they can use Amazon Redshift (data warehouse) to query Hudi tables, and now it's finally here.

Now you can use Amazon Redshift to query the Apache Hudi/Delta Lake table data in the Amazon S3 data lake. Amazon Redshift Spectrum as an Amazon Redshift feature allows you to query the S3 data lake directly from the Redshift cluster without having to load the data into it first, minimizing the time to gain insight into the value of the data.

Redshift Spectrum supports the Lake house schema and can query data across Redshift, Lake house, and operational databases without having to ETL or load data. Redshift Spectrum supports open data formats such as Parquet, ORC, JSON, and CSV. Redshift Spectrum also supports queries with complex nested data types such as struct, array, or map.

Redshift Spectrum allows you to read the latest snapshot of the Copy-on-Write (CoW) table of Apache Hudi version 0.5.2 and the latest Delta Lake version 0.5.0 table through a manifest file.

To query Apache Hudi's data in Copy-On-Write (CoW) format, you can use the Amazon Redshift-Spectrum table. The Hudi Copy On Write table is a collection of Apache Parquet files stored in Amazon S3. For more information, see the Copy-On-Write table in the open source Apache Hudi documentation.

When you create a table that references Hudi CoW-formatted data, map each column in the table to a column in the Hudi data. The mapping is done by column.

Data definition language (DDL) statements for Hudi partitioned and unpartitioned tables are similar to statements in other Apache Parquet file formats. For Hudi tables, INPUTFORMAT can be defined as org.apache.hudi.hadoop.HoodieParquetInputFormat. The LOCATION parameter must point to the underlying folder of the Hudi table that contains the .hoodie folder, which is necessary to establish the Hudi submission timeline. In some cases, the SELECT operation on the Hudi table may fail with the message * * No valid Hudi commit timeline found**. If so, check that the .hoodie folder is in the correct location and contains a valid Hudi submission timeline.

Note that the Apache Hudi format is only supported when using AWS Glue Data, and does not support using Apache Hive metastore as the external catalog.

Use the following command to define a non-partitioned table

CREATE EXTERNAL TABLE tbl_name (columns)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

STORED AS

INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

LOCATION's 3VOUGUAGUA S3UTHUTHUBUTERBUTION prefix'

Use the following command to define the partition table

CREATE EXTERNAL TABLE tbl_name (columns)

PARTITIONED BY (pcolumn1 pcolumn1-type [,...])

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

STORED AS

INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

LOCATION's 3VOUGUAGUA S3UTHUTHUBUTERBUTION prefix'

To add a partition to the Hudi partition table, use the ALTER TABLE ADD PARTITION command, where the LOCATION parameter points to the Amazon S3 subfolder that belongs to the partition.

Add a partition using the following command

ALTER TABLE tbl_name

ADD IF NOT EXISTS PARTITION (pcolumn1=pvalue1 [,...])

Is it helpful for you to read the above content after reading the above LOCATION's 3VOUGUR hand hand S3MurketBucket hand prefixLash partitionMutual path? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report