How to install and configure Alluxio for Apache Hudi 07/06 Update SLTechnology News&Howtos

How to install and configure Alluxio for Apache Hudi

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)05/31 Report--

Today, I would like to share with you the relevant knowledge points about how to install and configure Alluxio in Apache Hudi. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article.

1. What is Alluxio?

Alluxio builds a bridge between data-driven applications and storage systems, moving data from the storage tier closer to data-driven applications and making them easier to access. This also enables applications to connect to many storage systems through a common interface. Alluxio's memory-first hierarchical architecture enables data access to be several orders of magnitude faster than existing solutions.

For user applications and computing frameworks, Alluxio provides fast storage and facilitates data sharing and locality between jobs. When the data is local, Alluxio can provide data at memory speed; when the data is at Alluxio, Alluxio can provide data at the speed of computing the cluster network. The first time the data is accessed, the data is read only once from the storage system. For better performance, Alluxio recommends deployment on a computing cluster.

For storage systems, Alluxio makes up the gap between big data applications and traditional storage systems, and expands the set of available data workloads. Alluxio can be used as a unified layer for any number of different data sources when multiple data sources are mounted at the same time.

Alluxio can be divided into three parts: masters, workers, and clients. A typical setup consists of a primary server, multiple standby servers, and multiple worker. The client is used to communicate with the Alluxio server through Spark or MapReduce jobs, Alluxio command lines, and so on.

two。 What is Apache Hudi?

Apache Hudi allows you to store large amounts of data on top of hadoop-compatible storage, and it also provides two primitives that allow streaming on the data lake in addition to classic batch processing. The two primitives are:

Update/Delete records: Hudi uses fine-grained file / record-level indexes to support Update/Delete records while also providing transaction guarantees for write operations. The query processes the last submitted snapshot and outputs the results based on this. Change flow: Hudi provides first-class support for getting data changes: you can get the incremental stream of all updated/inserted/deleted records in a given table from a given point in time, and unlock the new query posture (category)

3. Step 3.1 preparation of the environment

Build alluxio environment with reference to the installation of the official website

3.2 execution

Append this configuration to the cores-site.xml file into which hudi can be loaded

Fs.alluxio.impl

Alluxio.hadoop.FileSystem

Add this dependency to the project pom.xml

Org.alluxio

Alluxio-shaded-client

2.2.1

Users can place the jar package where spark can load or introduce it in the following ways

-- jars alluxio-shaded-client-2.2.1.jar

At this point, you only need to write the data to alluxio. The use of deltastreamer requires the following configuration

-- target-base-path alluxio://.

Completing the above steps has completed the work of writing the hudi data to the alluxio. In fact, at this time, the data has not yet been loaded from hdfs to alluxio, so you need to query it once; query different hudi view methods

You can use hive sql queries. Use the command to query the hive table structure and find that loaction has pointed to alluxio that can be queried using spark sql. Spark.read.format ("org.apache.hudi") .option (xxx) .load ("alluxio://") 3.3 validation

Verify that the data will not be loaded into alluxio,in-alluxio when the query is not made, and it is greater than 0% when the data is loaded into alluxio,in-alluxio from hdfs after a query.

These are all the contents of the article "how to install and configure Alluxio for Apache Hudi". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.