How to look at the Design requirements of distributed File system from HDFS 07/02 Update SLTechnology News&Howtos

How to look at the Design requirements of distributed File system from HDFS

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "how to look at the design requirements of distributed file system from HDFS", which is easy to understand and well-organized, hoping to help you solve your doubts. Let me lead you to study and learn this article "how to look at the design requirements of distributed file system from HDFS".

The design requirements of a distributed file system are roughly as follows: transparency, concurrency control, scalability, fault tolerance, and security requirements. I would like to try to observe the design and implementation of HDFS from these angles, we can see the application scenario and design concept of HDFS more clearly.

First of all, transparency. If determined according to the standard of open distributed processing, there are eight kinds of transparency: access transparency, location transparency, concurrent transparency, replication transparency, fault transparency, mobile transparency, performance transparency and scalable transparency. For distributed file systems, the most important thing is to meet five transparency requirements:

1) transparency of access: users can access local files and remote file resources through the same operation. HDFS can do this. If HDFS sets up a cost file system rather than distributed, then programs that read and write distributed HDFS can read and write local files without modification, with configuration files to be modified. It can be seen that the transparency of the access provided by HDFS is not complete. After all, it is built on java and cannot modify the unix kernel like NFS or AFS, while handling local and remote files in a consistent manner.

2) location transparency: using a single file namespace, files or file collections can be relocated without changing the path name. The HDFS cluster has only one Namenode to manage the file system namespace, the block of files can be redistributed and replicated, block can add or decrease replicas, and replicas can be stored across racks, all of which are transparent to clients.

3) Mobile transparency: this is similar to location transparency. Files in HDFS are often copied or moved due to node failure, increase, or replication factor change or rebalancing, while client and client programs do not need to change anything. Namenode edits log files record these changes.

4) performance transparency and scalability transparency: the goal of HDFS is to build a distributed file system cluster on large-scale cheap machines. There is no doubt about scalability. For performance, please refer to some benchmark on its home page.

Second, concurrency control. The reading and writing of the file by the client should not affect the reading and writing of the same file by other clients. To achieve a single file copy semantics similar to that of a native file system, a distributed file system needs to make complex interactions, such as using timestamps, or similar callback promises (similar to server-to-client RPC callbacks when files are updated; callbacks have two states: valid or cancelled. The client checks the status of the callback commitment to determine whether the file on the server has been updated.

HDFS does not do this, its mechanism is very simple, allow only one write client at any time, the file has been created and written no longer change, its model is write-one-read-many, write once, read multiple times. This is consistent with its application, HDFS file size is usually megabytes to T level, these data will not be often modified, the most frequent is sequentially read and processed, random read very little, so HDFS is very suitable for MapReduce framework or web crawler applications. The size of HDFS files also determines that its clients cannot cache hundreds of commonly used files as some distributed file systems do.

Third, the function of file replication. A file can be represented as multiple copies of its contents in different locations. This brings two benefits: access to the same file can be obtained from multiple servers to improve the scalability of the service, and fault tolerance is improved. If a copy is corrupted, the file can still be obtained from other server nodes. The block of the HDFS file will be backed up for fault tolerance, according to the configured replication factor, the default is 3. The storage strategy of copies is also very particular, one on the node of the local rack, one on another node of the same rack, and the other on the other rack. This can prevent the loss of copies caused by failure to the maximum extent. Not only that, HDFS will also give priority to reading block from the same rack or even nodes in the same data center when reading files.

Fourth, the heterogeneity of hardware and operating system. Because it is built on the java platform, there is no doubt about the cross-platform ability of HDFS. Thanks to the file IO system encapsulated by the java platform, HDFS can implement the same client and server programs on different operating systems and computers.

Fifth, fault tolerance. In the distributed file system, it is very important to ensure that the file service can be used normally when there is something wrong with the client or server. The fault tolerance of HDFS can be divided into two aspects: the fault tolerance of the file system and the fault tolerance of Hadoop itself. The fault tolerance of the file system is achieved by several means:

Maintain heartbeat detection between Namenode and Datanode. When the heartbeat packet sent by Datanode is not normally received by Namenode due to reasons such as network failure, Namenode will not dispatch any new IO operations to that Datanode, and the data on this Datanode is considered invalid, so Namenode will check whether the number of copies of the file block is less than the set value, and if it is less than, it automatically starts to copy new copies and distribute them to other Datanode nodes.

Check the integrity of the file block, and HDFS records the checksum of all block for each newly created file. When you retrieve these files later, getting the block from a node will first confirm whether the checksum is consistent, and if not, a copy of the block will be obtained from other Datanode nodes.

The load balancing of the cluster, due to the failure or increase of nodes, may lead to uneven data distribution. When the free space of a Datanode node is greater than a critical value, HDFS will automatically migrate data from other Datanode.

The fsimage and edits log files on Namenode are the core data structures of HDFS, and if these files are corrupted, HDFS will fail. Therefore, Namenode can be configured to support the maintenance of multiple copies of FsImage and Editlog. Any changes to FsImage or Editlog will be synchronized to their copies. It always selects the most recent consistent use of FsImage and Editlog. Namenode is a single point of existence in HDFS, and manual intervention is necessary if the machine on which Namenode is located is wrong. The deletion and deletion of files is not immediately removed from the namespace from Namenode, but can be restored at any time in the / trash directory and will not be officially removed until the setting time is exceeded.

In addition to the fault tolerance of Hadoop itself, Hadoop supports upgrading and rollback. When upgrading Hadoop software, bug or incompatibility occurs, and you can roll back to the old Hadoop version.

The last one is the security issue, the security of HDFS is relatively weak, there is only simple file license control similar to the unix file system, and the future version will implement a kerberos verification system similar to NFS.

These are all the contents of the article "how to look at the design requirements of distributed file systems from HDFS". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.