Hadoop HDFS Balancer 04/25 Update SLTechnology News&Howtos

Hadoop HDFS Balancer

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Hadoop HDFS Balancer

Hadoop's HDFS cluster is very prone to machine-to-machine disk utilization imbalance, such as adding new data nodes to the cluster. When there is an imbalance in HDFS, it will cause many problems, such as MR programs can not make good use of the advantages of local computing, machines can not achieve better network bandwidth utilization, machine disks can not be used, and so on. It can be seen that it is very important to ensure the data balance in HDFS.

In Hadoop, a Balancer program is included. By running this program, the HDFS cluster can reach a balanced state. The command to use this program is as follows:

Sh $HADOOP_HOME/bin/start-balancer.sh-t 10%

Sh $HADOOP_HOME/bin/stop-balancer.sh

The-t parameter in this command is followed by the disk utilization deviation at which HDFS reaches equilibrium. If the deviation of disk utilization between machines is less than 10%, then we think that the HDFS cluster has reached a balanced state.

Several parameters that affect the hadoop balance tool:

-threshold default setting: 10, parameter value range: 0-100. parameter meaning: the target parameter to judge whether the cluster is balanced. The difference between each datanode storage utilization and the total cluster storage utilization should be less than this threshold. In theory, the smaller the setting of this parameter, the more balanced the whole cluster will be. However, in the online environment, the hadoop cluster is still writing and deleting data concurrently during balance. So it is possible that the set balance parameter value cannot be reached.

The default setting of dfs.balance.bandwidthPerSec: 1048576 (1 M balance S). Parameter meaning: set the bandwidth that can be occupied by the mapred tool during operation. Too large a setting may cause the mapred to run slowly.

Other features of the hadoop balance tool:

During the operation of the balance tool, the file blocks are moved iteratively from the high-usage datanode to the low-usage datanode, and the amount of data moved in each iteration does not exceed the smaller of the following two values: 10G or a specified threshold * capacity, and each iteration does not exceed 20 minutes. At the end of each iteration, the balance tool updates the file block distribution of the datanode. The following is an English description of the official document:

The tool moves blocks from highly utilized datanodes to poorly utilized datanodes iteratively. In each iteration a datanode moves or receives no more than the lesser of 10G bytes or the threshold fraction of its capacity. Each iteration runs no more than 20 minutes. At the end of each iteration, the balancer obtains updated datanodes information from the namenode.

Hadoop developers follow the following principles when developing Balancer programs:

1. In the process of data redistribution, we must ensure that the data can not be lost, the number of backups of data can not be changed, and the number of block in each rack must not be changed.

two。 The system administrator can start the data redistribution program or stop the data redistribution program with a command.

In the process of moving, 3.Block can not temporarily use too many resources, such as network bandwidth.

4. In the process of execution, the data redistribution program can not affect the normal work of name node.

Based on these basic points, the current logical flow of the Hadoop data redistribution program is shown in the following figure:

The Rebalance program is executed separately from the name node as a separate process.

1 Rebalance Server gets all the Data Node information from Name Node: the usage of each Data Node disk.

2 Rebalance Server calculates which machines need to move data and which machines can accept moving data. And get the data distribution that needs to be moved from Name Node.

3 Rebalance Server calculates which machine's block can be moved to another machine.

The machine that needs to move block will move the data to the destination machine, while deleting the block data on your own machine.

7 Rebalance Server obtains the execution result of this data movement and continues to execute the process, until there is no data to move or HDFS cluster and reaches the balanced standard.

The way Hadoop's existing Balancer programs work is very appropriate in most cases.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.