How to migrate hdfs data and hive meta data 10/16 Update SLTechnology News&Howtos

How to migrate hdfs data and hive meta data

2025-10-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to migrate hdfs data and hive meta data". In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

front

EMR cluster created on demand.

Migrating hdfs data

Mainly rely on distcp, the core is to get through the network, determine hdfs parameters and to migrate content, speed, migration.

network

You need to build your own cluster and EMR node network to interconnect. The same VPC network only needs the same security group, and different security groups need to set up security group interworking.

If the self-built cluster is a classic network and the EMR cluster is a vpc, you need to set CLASSICLINK for network access. See the documentation. For details, please consult ECS customer service.

After setting, you can ssh the old cluster node on the new cluster node to determine the network connectivity, distcp operation If there is an exception that xx node cannot connect to xx node, it means there is no connectivity, and you need to continue setting.

hdfs permission configuration confirmation

hdfs has permission settings, determine whether the old cluster has acl rules, whether to synchronize, check whether the configuration of dfs.permissions.enabled and dfs.namenode.acls.enabled is consistent between the new and old clusters, and modify them according to actual needs.

If there are acl rules to synchronize, add the-p synchronization permission parameter to the distcp parameter. If the distcp operation indicates that xx cluster does not support acl, the corresponding cluster is not configured. The new cluster is not configured. You can modify the configuration and restart NM. The old cluster does not support it, which means that the old cluster does not have acl settings at all, and does not need synchronization.

synchronization parameters

Synchronization is generally run on the new cluster, so synchronized jobs can run on the new cluster with less impact on the old cluster.

distcp parameter details, the general command format is as follows:

hadoop distcp -Ddfs.replication=3 -pbugpcax -m 1000 -bandwidth 30 hdfs://oldclusterip:8020 /user/hive/warehouse /user/hive/

Notes:

hdfs://oldclusterip:8020 Write the old cluster namenode ip, multiple namenodes write the current active.

Specify the number of copies 3, if you want to keep the original number of copies-p followed by r such as-prbugpcax. If permissions and acl are not synchronized, remove p and a after-p.

-m Specifies the number of maps, related to cluster size and data volume. For example, if the cluster has 2000 CPU cores, you can specify 2000 maps. -bandwidth Specifies the synchronization speed of a single map, which is achieved by controlling the replication speed of copies, and is an approximate value.

The overall migration speed is affected by inter-cluster bandwidth and cluster size. And the more files there are, the longer checksum takes. If migrating a large amount of data, try synchronizing several directories first to estimate the overall time. If you can synchronize only for a specified period of time, you can divide the directory into several smaller directories and synchronize them in turn.

If there are still writes to the old cluster during migration, you can synchronize the changes with-udpate.

Generally full synchronization requires a short business pause to enable double write double count or to switch business directly to the new cluster.

hive meta data synchronization

Hive meta data synchronization, essentially hive meta db, is generally mysql db data synchronization. Compared with general mysql data synchronization, pay attention to location changes and hive version alignment.

meta db settings

When there is a lot of meta data, it is generally recommended to use rds as meta db. A self-built cluster may already have an rds db. Due to different locations, a new database is generally required. *** The practice is to create a new rds database under an availability zone and a vpc security group with EMR cluster.

Log in to the master node of the new cluster (if it is a ha cluster, both masters are required), modify/usr/local/emr/emr-agent/run/meta_db_info.json, set use_local_meta_db to false, and replace the link address, username and password of the meta database information with the information of the new rds. Then restart the metaserver of the hive component.

Initialize meta table information:

··· cd /usr/lib/hive-current/bin ./ schematool - initSchema -dbType mysql ···

location

Hive tables, partitions and other information have location information, with dfs nameservices prefix, such as hdfs://mycluster:8020/, while EMR cluster nameservices prefix is unified emr-cluster, so it needs to be corrected. The revised *** method is to export the data mysqldump --databases hivemeta --single-transaction -u root-p > hive_databases.sql, replace hdfs://oldcluster:8020/with sed, and import it into the new db.

mysql hivemeta -p < hive_databases.sql

version alignment

The hive version of EMR is generally the stable version of the current community ***, and the self-built cluster hive version may be older, so the imported old version data may not be used directly. Need to execute hive upgrade script, address. For example, upgrading hive from 1.2 to 2.3.0 requires upgrading-1.2.0-to-2.0.0.mysql.sql, upgrading-2.0.0-to-2.1.0.mysql.sql, upgrading-2.1.0-to-2.2.0.mysql.sql, upgrading-2.2.0-to-2.3.0.mysql.sql in sequence. Script is mainly to create tables, add fields, change the content, if there is a table already exists, the field already exists exception can be ignored.

verification

After all meta data is corrected, metaserver can be restarted. Command line hive, query libraries and tables, query data, verify correctness.

"How to migrate hdfs data and hive meta data" is introduced here. Thank you for reading it. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.