Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The heavily used method of integrating Apache Hudi with Vertica

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

The main content of this article is to explain how to integrate Vertica into Apache Hudi. Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Vertica integration Apache Hudi heavy use of the method" it!

1. Abstract

This article demonstrates the use of external tables to integrate Vertica and Apache Hudi. In the demonstration, we used Apache Hudi on Spark to ingest the data into S3 and used the Vertica external table to access the data.

2. Apache Hudi introduction

Apache Hudi is a change data capture (CDC) tool that records transactions in tables at different timelines. Hudi stands for Hadoop Upserts Deletes and Incrementals and is an open source framework. Hudi provides ACID transactions, extensible metadata processing, and unified streaming and batch data processing.

The following flowchart illustrates the process. Use the Hudi installed on Apache Spark to process the data to S3 and read the data changes in S3 from the Vertica external table.

3. Environmental preparation

Apache Spark environment. Tested with a 4-node cluster with 1 Master and 3 Worker. Follow the instructions in setting up Apache Spark on a multi-node cluster to install the Spark cluster environment. Start the Spark multi-node cluster.

Vertica analysis database. It was tested with Vertica Enterprise 11.0.0.

AWS S3 or S3 compatible object storage. Tested using MinIO as the S3 bucket.

The following jar file is required. Copy the jar to any desired location on the Spark machine and place these jar files in / opt/spark/jars.

Hadoop-hadoop-aws-2.7.3.jar

AWS-aws-java-sdk-1.7.4.jar

Run the following command in the Vertica database to set the S3 parameters to access the bucket:

SELECT SET_CONFIG_PARAMETER ('AWSAuth',' accesskey:secretkey'); SELECT SET_CONFIG_PARAMETER ('AWSRegion','us-east-1'); SELECT SET_CONFIG_PARAMETER (' AWSEndpoint',':9000'); SELECT SET_CONFIG_PARAMETER ('AWSEnableHttps','0')

The endpoint may be different, depending on the S3 object store selected for the S3 bucket location.

4. Vertica and Apache Hudi integration

To integrate Vertica with Apache Hudi, you first need to integrate Apache Spark with Apache Hudi, configure jars, and access the connection to AWS S3. Second, connect the Vertica to the Apache Hudi. Then perform Insert, Append, Update and other operations on the S3 bucket.

Follow the steps in the following section to write the data to Vertica.

Configure Apache Hudi and AWS S3 on Apache Spark

Configure Vertica and Apache Hudi integration

4.1 configure Apache Hudi and AWS S3 on Apache Spark

Run the following command on the Apache Spark machine.

This will download the Apache Hudi package, the configuration jar file, and AWS S3

/ opt/spark/bin/spark-shell\-conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"\-packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1

Import the required packages for reading, writing, etc., of Hudi:

Import org.apache.hudi.QuickstartUtils._import scala.collection.JavaConversions._import org.apache.spark.sql.SaveMode._import org.apache.hudi.DataSourceReadOptions._import org.apache.hudi.DataSourceWriteOptions._import org.apache.hudi.config.HoodieWriteConfig._

Use the following command to configure Minio access keys, Secret key, Endpoint, and other S3A algorithms and paths as needed.

Spark.sparkContext.hadoopConfiguration.set ("fs.s3a.access.key", "*") spark.sparkContext.hadoopConfiguration.set ("fs.s3a.secret.key", "*") spark.sparkContext.hadoopConfiguration.set ("fs.s3a.endpoint", "http://XXXX.9000")spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access"," true) sc.hadoopConfiguration.set ("fs.s3a.signing-algorithm", "S3SignerType")

Create variables to store the table name and S3 path of MinIO.

Val tableName = "Trips" val basepath = "s3a://apachehudi/vertica/"

Prepare the data and use Scala to create sample data in Apache spark

Val df = Seq (("aaa", "R1", "D1", 10, "US", "20211001"), ("bbb", "R2", "D2", 20, "Europe", "20211002"), ("ccc", "R3", "d3", 30, "India", "20211003"), ("ddd", "R4", "D4", 40, "Europe", "20211004"), ("eee", "R5", "D5", 50, "India", "20211005") ToDF ("uuid", "rider", "driver", "fare", "partitionpath", "ts")

Write data to AWS S3 and verify this data

Df.write.format ("org.apache.hudi") .options (getQuickstartWriteConfigs) .option (PRECOMBINE_FIELD_OPT_KEY, "ts") .option (RECORDKEY_FIELD_OPT_KEY, "uuid") .option (PARTITIONPATH_FIELD_OPT_KEY, "partitionpath") .option (TABLE_NAME, tableName) .mode (Overwrite) .save (basePath)

Run the following command using Scala to verify that the data is read correctly from the S3 bucket.

Spark.read.format ("hudi") .load (basePath) .createOrReplaceTempView ("dta") spark.sql ("select _ hoodie_commit_time, uuid, rider, driver, fare,ts, partitionpath from dta order by uuid") .show ()

4.2 configure Vertica and Apache HUDI integration

Create an external table in vertica that contains data from the Hudi table on S3. We created the "Travel" table.

CREATE EXTERNAL TABLE Trips (_ hoodie_commit_time TimestampTz,uuid varchar,rider varchar,driver varchar,fare int,ts varchar,partitionpath varchar) AS COPY FROM's3a://apachehudi/parquet/vertica/*/*.parquet' PARQUET

Run the following command to verify that the external table is being read:

4.3 how to let Vertica view changed data

The following sections contain examples of actions performed to view changed data in Vertica.

4.3.1 write data

In this example, we use Scala to run the following command in Apache spark with some data attached:

Val df2 = Seq (("fff", "R6", "d6", 50, "India", "20211005") .toDF ("uuid", "rider", "driver", "fare", "partitionpath", "ts")

Run the following command to attach this data to the Hudi table on S3:

Df2.write.format ("org.apache.hudi") .options (getQuickstartWriteConfigs) .option (PRECOMBINE_FIELD_OPT_KEY, "ts") .option (RECORDKEY_FIELD_OPT_KEY, "uuid") .option (PARTITIONPATH_FIELD_OPT_KEY, "partitionpath") .option (TABLE_NAME, tableName) .mode (Append) .save (basePath) 4.3.2 update the data

In this example, we update a record for the Hudi table. Data needs to be imported to trigger and update the data:

Val df3 = Seq (("aaa", "R1", "D1", 100,100, "US", "20211001"), ("eee", "R5", "D5", "India", "20211001"). ToDF ("uuid", "rider", "driver", "fare", "partitionpath", "ts")

Run the following command to update the data to the HUDI table on S3:

Df3.write.format ("org.apache.hudi") .options (getQuickstartWriteConfigs) .option (PRECOMBINE_FIELD_OPT_KEY, "ts") .option (RECORDKEY_FIELD_OPT_KEY, "uuid") .option (PARTITIONPATH_FIELD_OPT_KEY, "partitionpath") .option (TABLE_NAME, tableName) .mode (Append) .save (basePath)

The following is the output of spark.sql:

The following is the Vertica output:

4.3.3 create and view historical snapshots of data

Execute the following spark command pointing to a specific timestamp:

Val dd = spark.read.format ("hudi") .option ("as.of.instant", "20211007092600") .load (basePath)

Write data to parquet in S3 using the following command:

Dd.write.parquet ("s3a://apachehudi/parquet/p2")

In this example, we are reading a snapshot of the Hudi table as of the date "20211007092600".

Dd.show

Execute the command from Vertica by creating an external table on the parquet file.

At this point, I believe that you have a deeper understanding of "Vertica integration Apache Hudi heavyweight use method", might as well come to practical operation! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report