How to manage DataNode disk in HDFS 07/11 Update SLTechnology News&Howtos

How to manage DataNode disk in HDFS

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to manage DataNode disks in HDFS. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

In the case of multiple disks managed by a single DataNode, the usage per disk is more average when performing normal write operations. However, adding or replacing disks will lead to a serious imbalance in DataNode disk usage. Traditional HDFS equalizers focus on DataNode (inter-) rather than intra-, but in Hadoop 3.0 and above, the new HDFS Intra-DataNode disk balancer can solve the above problems very well.

HDFS now includes (available in CDH 5.8.2 and later) a comprehensive storage capacity management method for moving data across nodes. Students who are familiar with the simple configuration of Hadoop should know hdfs-site.xml. In HDFS, DataNode extends data blocks to the local file system directory, which can be specified using dfs.datanode.data.dir in hdfs-site.xml. During a typical installation, each directory (called a volume in HDFS terminology) is on a different device (for example, on separate HDD and SSD).

When we write a new block to HDFS, DataNode uses the volume selection strategy to select the disk for the block. Currently, two policy types are supported: polling (round-robin) and free space (available space) (HDFS-1804).

In short, as shown in the figure above, the polling policy distributes new blocks evenly among the available disks, while the free space policy gives priority to writing data to the disk with the most free space (as a percentage).

By default, DataNode uses a polling-based policy to write new blocks. However, in a long-running cluster, DataNode may still create significantly unbalanced volumes due to events such as the deletion of a large number of files in HDFS or the addition of new DataNode disks through disk hot swap. Even if you use a free space-based volume selection strategy, volume imbalances can still cause disk Imax O to be inefficient: for example, each new write goes to a newly added empty disk while other disks are idle, which can cause bottlenecks for new disks.

The Apache Hadoop community has developed server offline scripts to alleviate data imbalances. However, because within the HDFS code base, these scripts require DataNode to be offline before moving data between disks. As a result, HDFS-1312 introduced an online disk balancer designed to rebalance volumes on running DataNode according to various metrics. Like HDFS Balancer, the HDFS disk balancer runs as a thread in DataNode to move block files across volumes of the same storage type.

Disk balancer 101

Let's use the example to explore this feature step by step. First, verify that the dfs.disk.balancer.enabled configuration on all DataNode is set to true. Starting with CDH 5.8.2, you can specify this configuration through the HDFS relief valve section in Cloudera Manager:

In this example, we will add a new disk to the preloaded HDFS DataNode (/ mnt/ disk1) and mount the new disk into / mnt/disk2. In CDH, each HDFS data directory is on a separate disk, so users can use df to display disk usage:

# df-h... . / var/disk1 5.8G 3.6G 1.9G 66% / mnt/disk1 / var/disk2 5.8G 13M 5.5G 1% / mnt/disk2

Obviously, it's time to balance the disk again! The traditional disk balancer task involves three steps (implemented by the HDFS diskbalancer command): planning, execution, and query. In the first step, the HDFS client reads the necessary information about the specified DataNode from NameNode to generate an execution plan:

# hdfs diskbalancer-plan lei-dn-3.example.org 16-08-19 18:04:01 INFO planner.GreedyPlanner: Starting plan for Node: lei-dn-3.example.org:20001 16-08-19 18:04:01 INFO planner.GreedyPlanner: Disk Volume set 03922eb1-63af-4a16-bafe-fde772aee2fa Type: DISK plan completed.Th 16-08-19 18:04:01 INFO planner.GreedyPlanner: Compute Plan for Node: lei-dn-3.example.org:20001 took 5 ms 16-08-19 18: 04:01 INFO command.Command: Writing plan to: / system/diskbalancer/2016-Aug-19-18-04-01

As you can see from the output, the HDFS disk balancer reports disk usage information to NameNode through the data node and calculates the data movement steps on the specified DataNode. Each step specifies the source and destination volumes of the data to be moved, as well as the amount of data expected to be moved.

At the time of this writing, the only planner supported by HDFS is GreedyPlanner, which constantly moves data from the most commonly used devices to the least used devices until all data is evenly distributed across all devices. You can also specify a threshold for space utilization in the plan command, and if the difference in space utilization is below the threshold, the planner will assume that the disk is balanced. Another noteworthy option is to restrict disk balancer task I bandwidth O by specifying-- disk during the planning process, so as not to affect the foreground work. )

The disk balancer executes the plan to generate JSON files stored in HDFS. By default, the schedule file is stored in the / system/diskbalancer directory:

# hdfs dfs-ls / system/diskbalancer/2016-Aug-19-18-04-01 Found 2 items-rw-r--r-- 3 hdfs supergroup 1955 18:04 / system/diskbalancer/2016-Aug-19-18-04-01/lei-dn-3.example.org.before.json-rw-r--r-- 3 hdfs supergroup 908 2016-08-19 18:04 / system/diskbalancer/2016-Aug-19-18-04-01 / lei-dn-3.example.org.plan.json

To execute the plan on DataNode, run:

$hdfs diskbalancer-execute / system/diskbalancer/2016-Aug-17-17-03-56/172.26.10.16.plan.json 17:22:08 on 16-08-17 INFO command.Command: Executing "execute plan" command

This command is used to submit the JSON plan file to DataNode, which executes in the background of the BlockMoverthread thread.

To check the status of the diskbalancer task on DataNode, use the query command:

# hdfs diskbalancer-query lei-dn-3:20001 21:08:04 on 16-08-19 INFO command.Command: Executing "query plan" command. Plan File: / system/diskbalancer/2016-Aug-19-18-04-01/lei-dn-3.example.org.plan.json Plan ID: ff735b410579b2bbe15352a14bf001396f22344f7ed5fe24481ac133ce6de65fe5d721e223b08a861245be033a82469d2ce943aac84d9a111b542e6c63b40e75 Result: PLAN_DONE

The output (PLAN_DONE) indicates that the disk balancing task is complete. If you want to verify the effectiveness of the disk balancer, you can use df-h again to view the data distribution across the two local disks:

# df-h Filesystem Size Used Avail Use% Mounted on... . / var/disk1 5.8G 2.1G 3.5G 37% / mnt/disk1 / var/disk2 5.8G 1.6G 4.0G 29% / mnt/disk2

As long as the output confirms that the disk balancer has successfully reduced the difference in disk space usage to less than 10%, that means the task is complete.

This is the end of this article on "how to manage DataNode disks in HDFS". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.