How to use EMR Spark Relational Cache to synchronize data across clusters 07/19 Update SLTechnology News&Howtos

How to use EMR Spark Relational Cache to synchronize data across clusters

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces how to use EMR Spark Relational Cache to synchronize data across clusters. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Using Relational Cache to accelerate EMR Spark data Analysis

Background

Relational Cache is an important feature supported by EMR Spark, which provides a function similar to the materialized view of traditional data warehouse through pre-organization and pre-calculation to accelerate data analysis. In addition to improving the speed of data processing, Relational Cache can also be applied to many other scenarios. This article focuses on how to use Relational Cache to synchronize data tables across clusters.

Managing all data through a unified Data Lake is the goal of many companies, but in reality, due to the existence of multiple data centers, different network Region, and even different departments, it is inevitable that there will be many different big data clusters, and the data synchronization requirements of different clusters are widespread. In addition, cluster migration and new and old data synchronization involved in moving stations is also a common problem. The work of data synchronization is usually a painful process. The development of migration tools, incremental data processing, synchronization of reading and writing, subsequent data comparison and so on, require a lot of customized development and manual intervention. Based on Relational Cache, users can simplify this part of the work and achieve cross-cluster data synchronization at less cost.

Let's use a concrete example to show how to achieve cross-cluster data synchronization through EMR Spark Relational Cache.

Use Relational Cache to synchronize data

Suppose we have two clusters, An and B, and we need to synchronize the data of the activity_log table from cluster A to cluster B. during the whole process, new data will be continuously inserted into the activity_ log table. The table creation statement of activity_log in cluster An is as follows:

CREATE TABLE activity_log (user_id STRING, act_type STRING, module_id INT, d_year INT) USING JSONPARTITIONED BY (d_year)

Insert two messages to represent historical information:

INSERT INTO TABLE activity_log PARTITION (d_year = 2017) VALUES ("user_001", "NOTIFICATION", 10), ("user_101", "SCAN", 2)

Create a Relational Cache for the activity_log table:

CACHE TABLE activity_log_syncREFRESH ON COMMITDISABLE REWRITEUSING JSONPARTITIONED BY (d_year) LOCATION "hdfs://192.168.1.36:9000/user/hive/data/activity_log" AS SELECT user_id, act_type, module_id, d_year FROM activity_log

REFRESH ON COMMIT means that the cache data is updated automatically when the source table data is updated. Through LOCATION, we can specify the storage address of cache data, and we point the address of cache to the HDFS of cluster B to synchronize the data from cluster A to cluster B. In addition, the field and Partition information of Cache is consistent with that of the source table.

In cluster B, we also create an activity_log table with the following statement:

CREATE TABLE activity_log (user_id STRING, act_type STRING, module_id INT, d_year INT) USING JSONPARTITIONED BY (d_year) LOCATION "hdfs:///user/hive/data/activity_log"

Execute MSCK REPAIR TABLE activity_log to automatically repair the relevant meta information, and then execute the query statement, you can see that in cluster B, you have been able to find the two pieces of data inserted in the table of cluster A.

Continue to insert new data in cluster A:

INSERT INTO TABLE activity_log PARTITION (d_year = 2018) VALUES ("user_011", "SUBCRIBE", 24)

Then execute MSCK REPAIR TABLE activity_log in cluster B and query the activity_log table again, and you can find that the data has been automatically synchronized into the activity_ log table of cluster B. for partition tables, when new partition data is added, Relational Cache can incrementally synchronize new partition data instead of resynchronizing all data.

If the new data of activity_log in cluster An is not inserted through Spark, but imported into the Hive table externally through Hive or other means, the user can trigger synchronization data manually or through script through the REFRESH TABLE activity_log_sync statement, and if the new data is imported in bulk by partition, you can also incrementally synchronize the partition data through similar REFRESH TABLE activity_log_sync WITH TABLE activity_log PARTITION (d_year=2018) statements.

Relational Cache can ensure the data consistency of the activity_log table in cluster An and cluster B. downstream tasks or applications that rely on the activity_ log table can switch to cluster B at any time. At the same time, users can also pause the applications or services that write data to the activity_ log table in cluster An at any time, point to the activity_log table in cluster B and restart the service, so as to complete the migration of upper applications or services. Clean up the activity_log and activity_log_sync in cluster A when you are finished.

It is very simple and convenient to synchronize data between data tables of different big data clusters through Relational Cache. In addition, Relational Cache can also be applied to many other scenarios, such as building a second-response OLAP platform, interactive BI,Dashboard applications, accelerating the ETL process, and so on.

On how to use EMR Spark Relational Cache cross-cluster synchronization data sharing here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.