Introduction of erasure codes with new features of HDFS 3.x 07/09 Update SLTechnology News&Howtos

Introduction of erasure codes with new features of HDFS 3.x

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge of "introduction of erasure codes for new features of HDFS 3.x". Many people will encounter this dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

New feature of HDFS 3.x data storage-erasure code

HDFS is a distributed file system with high throughput and high fault tolerance, but HDFS not only ensures high fault tolerance, but also brings high storage costs. For example, 5T of data is stored on HDFS. According to the default 3-copy mechanism of HDFS, 15T of storage space will be occupied. Is there a mechanism that can achieve the same fault tolerance as the replica mechanism but greatly reduce the storage cost? yes, it is the erasure code mechanism introduced in HDFS 3.x version.

1. EC introduction

Erasure Coding is abbreviated to EC, Chinese name: erasure code

EC (erasure code) is a coding technology. Before HDFS, this coding technology was most widely used in redundant array of cheap disks (RAID). RAID implements EC through striping technology, which is a technology that automatically balances the load of EC to multiple physical disks. The principle is to divide a continuous piece of data into many small parts and store them on different disks. This allows multiple processes to access different parts of the data at the same time without disk collisions (which may occur when multiple processes access a disk at the same time). And when you need to access such data sequentially, you can get the maximum Igamo parallelism, resulting in very good performance.

In HDFS, dividing continuous data into many small parts is called striping unit. For each stripe unit of the original data unit, a certain number of parity units are calculated and stored, and the process of calculation is called coding. Any error on the striping unit can be recovered by decoding calculation based on the remaining data and the parity unit.

2. HDFS data redundant storage strategy

The storage strategy of HDFS is replica mechanism, which improves the security of data storage, but also brings additional overhead. The default 3-replica scheme of HDFS has 200% additional overhead on storage space and other resources (such as network bandwidth). However, for data with relatively low IHDFS O activity, other block replicas are rarely accessed during normal periods. But it still consumes the same amount of resources as the first copy.

Therefore, a major improvement in HDFS 3.x version is the use of erasure codes (EC) instead of copy mechanisms. Erasure codes provide the same fault tolerance as copy mechanisms with much less storage space. In a typical erasure code (EC) setting, the storage overhead does not exceed 50%.

3. Implementation principle of EC algorithm

There are many algorithms to implement EC. One of the more common algorithms is Reed-Solomon (RS), which has two parameters, denoted as RS (kforce m), k for data block and m for parity block. The maximum number of parity blocks (including data block and parity block) is tolerated by the number of parity blocks. The specific principle is explained by the following example:

We use RS (3J2), which means that 3 raw data blocks and 2 check blocks are used.

For example, the generating matrix GT and three original data blocks Data of 7, 8 and 9 can be obtained from RS (3), and two check data blocks 50,122 can be calculated by matrix multiplication. At this time, the original data plus check data, a total of five data blocks: 7, 8, 9, 50, 122, you can lose two, and then recover through the algorithm. The matrix multiplication is shown in the following figure:

Matrix multiplication

GT is the generating matrix, and the generating matrix of RS (kjime m) is the matrix of m rows and k columns.

Data represents raw data, and 7, 8, and 9 represents raw data blocks.

Parity represents check data, and 50122 represents check data blocks.

So for three original data blocks, if two check blocks are used, EC coding takes up a total of five blocks of disk space, which is equivalent to the fault tolerance of six blocks occupied by the two-copy mechanism.

4. Application scenarios of EC

Integrating EC technology into HDFS can improve storage efficiency while still providing data persistence similar to traditional replica-based HDFS deployments. For example, a 3-copy file with 6 blocks will consume 6 * 3 = 18 disk space. However, when deployed with EC (6 data, 3 parity), it will consume only 9 blocks of disk space.

However, EC will use a lot of CPU resources during the coding process and data reconstruction, and most of the data are read remotely, so there will be a lot of network overhead.

Therefore, when the CPU resources are tight and the storage cost is low, the copy mechanism can be used to store the data. When there is a surplus of CPU resources and the storage cost is high, the EC mechanism can be used to store the data.

5. The architecture of EC in HDFS

HDFS uses Online EC directly (writing data in EC format), avoiding the conversion phase and saving storage space. Online EC also enhances the performance of sequential I hand O by using multiple disk spindles in parallel. This is especially ideal in clusters with high-end networks. Second, it naturally distributes a small file to multiple DataNode without bundling multiple files into a coding group. This greatly simplifies file operations, such as deletion, disk quotas, and migration between namespaces.

In general HDFS clusters, small files can account for more than 3 of the total storage consumption. In order to better support small files, HDFS currently supports the EC scheme of Strip layout (Striping Layout), while the HDFS continuous layout (Contiguous Layout) scheme is under development.

Bar layout:

Strip layout

Advantages:

Less data is cached on the client

It applies regardless of file size.

Disadvantages:

Can affect the performance of some location-sensitive tasks because blocks that were originally on one node are scattered across several different nodes

And multi-copy storage strategy conversion is more troublesome.

Continuous layout:

Continuous layout

Advantages:

Easy to implement

It is convenient to convert with multi-copy storage strategy.

Disadvantages:

The client needs to cache enough data blocks

Not suitable for storing small files.

In traditional mode, the basic unit of files in HDFS is block, while in EC mode, the basic unit of files is block group. Take RS (3 block group 2) as an example, each block group contains 3 data blocks and 2 parity blocks.

The main extensions that HDFS has made to introduce the EC pattern are as follows:

The NameNode:HDFS file is logically composed of block group, and each block group contains a certain number of internal blocks. in order to reduce the memory consumption of these internal blocks to NameNode, HDFS introduces a new hierarchical block naming protocol. The ID of block group can be inferred from the ID of any of its internal blocks. This allows management at the block group level rather than at the block level.

Client: client read and write paths have been enhanced to process multiple internal blocks in block group in parallel.

DataNode:DataNode runs additional ErasureCodingWorker (ECWorker) tasks for background recovery of failed erasure correction blocks. NameNode detects a failed EC block and selects a DataNode for recovery work. This process is similar to how to restore the block of a copy if it fails. Rebuild and execute three key task nodes:

Read data from the source node: read input data from the source node in parallel using a dedicated thread pool. Based on the EC policy, a read request is initiated for all sources and targets, and only a minimum number of input blocks are read for reconstruction.

Decoding data and generating output data: decoding new data and parity blocks from input data. All lost data is decoded together with the parity block.

Transfer the generated data block to the target node: after decoding is complete, the recovered block will be transferred to the target DataNodes.

Erasure strategy: files and directories in a HDFS cluster are allowed to have different replication and erasure policies to accommodate heterogeneous workloads. The erasure code strategy encapsulates how to encode / decode a file. Each policy is defined by the following information:

EC mode: this includes the number of data and parity blocks in the EC group (for example, 6 + 3), as well as codec algorithms (such as Reed-Solomon,XOR).

The size of the striped unit. This determines the granularity of stripe reads and writes, including buffer size and encoding.

We can define our own EC policy through the XML file, which must contain the following three parts:

Layoutversion: this indicates the version of the EC policy XML file format.

Schemas: this includes all user-defined EC schemas.

Policies: this includes all user-defined EC policies, each consisting of schema id and striped unit size (cellsize).

There is a sample XML file to configure the EC policy in the Hadoop conf directory, which you can refer to when configuring, and the file name is user_ec_policies.xml.template.

6. Hardware configuration of cluster

Erasure codes have certain requirements for clusters in terms of CPU and network:

Encoding and decoding work consumes additional CPU on the HDFS client and DataNode.

Erasure code files are also distributed throughout the rack to achieve rack fault tolerance. This means that when reading and writing striped files, most operations are done on the rack. Therefore, the bisection bandwidth of the network is very important.

For rack fault tolerance, it is also important to have at least as many racks as the configured EC stripe width. "for the EC policy RS (6. 3), this means a minimum of 9 racks, ideally 10 or 11 racks, to handle planned and unplanned outages." "for clusters with racks less than stripe width, HDFS cannot maintain rack fault tolerance, but still attempts to distribute striped files across multiple nodes to preserve node-level fault tolerance."

7. Last

By default, all EC policies are disabled by HDFS, and we can enable the EC policy through the hdfs ec [- enablePolicy-policy] command based on the size of the cluster and the desired fault-tolerant attributes.

For example, for a cluster with 9 racks, a policy such as RS-10-4-1024k will not retain rack-level fault tolerance, while RS-6-3-1024k or RS-3-2-1024k may be more appropriate.

RS-10-4-1024k indicates that there are 10 blocks and 4 check blocks.

Under the replica mechanism, we can set the replica factor and specify the number of replicas, but under the EC policy, it is meaningless to specify the replica factor because it is always 1 and cannot be changed through the relevant command.

This is the end of the introduction of erasure codes, the new features of HDFS 3.x. Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.