How to use tiered storage to make HDFS more efficient 07/04 Update SLTechnology News&Howtos

How to use tiered storage to make HDFS more efficient

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to use tiered storage to make HDFS more efficient". The content is simple and clear. I hope it can help you solve your doubts. Let me lead you to study and learn this article "how to use tiered storage to make HDFS more efficient".

Hadoop and its promise

It is well known that commercial hardware can be assembled to create Hadoop clusters with big data's storage and computing capabilities. The data is divided into multiple parts and stored on each separate machine, and the data processing logic is executed on the same machine.

For example, a Hadoop cluster consisting of 1000 nodes has a single node capacity of 20TB and can store up to 20PB data. As a result, all of these machines have enough computing power to fulfill Hadoop's slogan "take compute to data".

Temperature of data

Different types of datasets are usually stored in a cluster, through which different teams can share their different types of tasks. Through the data pipeline, each dataset grows all the time.

A common feature of data sets is that the initial usage will be very large. During this period, the dataset is considered "HOT". Through the analysis, we find that with the passage of time, the utilization rate will decline to a certain extent, and the stored data is accessed only a few times a week, and gradually becomes "WARM" data. Over the next 90 days, when data usage falls to several times a month, it is defined as "COLD" data.

As a result, the data were considered "hot" in the first few days and remained "warm" for months thereafter. During this period, the task or application will use the data several times. As the use of data declines more, it becomes "cold" and may only be used a few times in the next 90 days. In the end, when the data is used only once or twice a year, rarely, its "temperature" is "frozen".

Data Age

Usage Frequency

Temperature

Age

< 7 days 20 times a day HOT 7 days >

Age

< 1 month 5 times a week WARM 1 month < Age < 3 months 5 times a month COLD 3 months < Age < 3 years 2 times a year FROZEN 一般来讲，温度与每个数据集都紧密相关。在这个例子中，温度是与数据的年龄成反比的。一个特定数据集的温度也受其他因素影响的。你也可以通过算法决定数据集的温度。 HDFS的分层存储 HDFS从Hadoop2.3开始支持分层存储它是如何工作的呢？正常情况下，一台机器添加到集群后，将会有指定的本地文件系统目录来存储这块副本。用来指定本地存储目录的参数是dfs.datanode.dir。另一层中，比如归档(ARCHIVE)层，可以使用名为StorageType的枚举来添加。为了表明这个本地目录属于归档层，该本地目录配置中会带有[ARCHIVE]的前缀。理论上，hadoop集群管理员可以定义多个层级。比如说：如果在一个已有1000个节点，其总存储容量为20PB的集群上，增加100个节点，其中每个节点有200TB的存储容量。相比已有的1000个节点，这些新增节点的计算能力就相对较差。接下来，我们在所有本地目录的配置中增加ARCHIVE的前缀。那么现在位于归档层的这100个节点将会有 20PB的存储量。***整个集群被划分为两层——磁盘(DISK)层和归档(ARCHIVE)层，每一层有20PB的容量，总容量为40PB。基于温度将数据映射到存储层在这个例子中，我们将在拥有更强计算能力节点的DISK层存储高频率使用的"热(HOT)"数据。至于"温(WARM)"数据，我们将其大部分的副本存储在磁盘层。对于复制因子(replication factor)为3的数据，我们将在磁盘层存储其两个副本，在归档层存储一个副本。如果数据已经变"冷(COLD)",那么我们至少将在磁盘层存储其每个块的一个副本。余下的副本都放入归档层。

When a dataset is considered "FROZEN", it means that it is almost unused, and it is unwise to store it on a node with a large amount of CPU that can perform many task nodes or containers. We will store it on a node with minimal computing power. Therefore, all copies of all blocks in the "FROZEN" state can be moved to the archive layer.

Cross-layer data flow

When data is added to the cluster * * times, it will be stored at the default disk layer. Based on the temperature of the data, one or more copies of it will be moved to the archive layer. A mover is used to move data from one tier to another. The mover works like a balancer, except that it can move a copy of the block across layers. The mover can accept a HDFS path, a number of copies, and destination stratum information. It then identifies the copy to be moved based on the information of the layer and schedules the movement of the data from the source data node to the destination data node.

Changes to support tiered storage in Hadoop 2.6

There are many improvements in Hadoop 2.6 that enable it to further support tiered storage. You can attach a storage policy to a directory to indicate whether it is "HOT", "WARM", "COLD", or "FROZEN". The storage policy defines the number of replicas that can be stored at each tier. I can change the storage policy of the directory and start the directory's mover to make the policy effective.

Applications that use data

Depending on the temperature of the data, some or all copies of the data may be stored in any layer. But for applications that use data through HDFS, the location is transparent.

Although all copies of "frozen" data are in the archive layer, applications can still access it as if it were any data in HDFS. Because the nodes in the archive layer do not have computing power, the mapping (map) task running on the disk layer will read data from the nodes in the archive layer, but this will increase the network traffic consumption of the application. If this happens frequently, you can specify the data as "warm / cold" and have the mover move back one or more copies to the disk layer.

Determining the data temperature and completing the specified copy moving to predefined tiered storage can be fully automated.

Tiered storage of eBay

EBay uses tiered storage on one of the very large clusters. The cluster has 40PB data. We have added additional storage capacity with limited computing power for 10PB. Every new machine can store 220TB. We marked the added storage as archiving layer and marked some directories as "warm", "cold" or "frozen". Then move all or part of the copies to the archive layer according to their temperature.

The price per GB archive layer is four times lower than the disk layer price. This difference is mainly due to the fact that the machines at the archiving layer have very limited computing power, so the cost is reduced.

The above is all the content of the article "how to use tiered storage to make HDFS more efficient". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.