What is the design concept of HDFS? 04/28 Update SLTechnology News&Howtos

What is the design concept of HDFS?

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what the design concept of HDFS is. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

1. Introduction of HDFS

HDFS (Hadoop Distributed File System) is an important part of the hadoop ecosystem and a storage component in the Hadoop. It has a special position in the whole Hadoop and is the most basic part, because it involves data storage. MapReduce and other computing models all rely on the data stored in the HDFS. HDFS is a distributed file system that stores very large files in streaming data access mode and stores data in blocks on different machines in a commercial hardware cluster. HDFS was originally developed as the infrastructure of the Apache Nutch search engine project. HDFS is part of the Apache Hadoop Core project.

The problem solved by distributed file system is big data storage. They are storage systems that span multiple computers. Distributed file systems have a wide application prospect in the era of big data, and they provide the necessary scalability for storing and processing very large-scale data.

Second, HDFS design concept

Hardware failure is the norm, while HDFS is made up of hundreds of servers, each of which may fail. Therefore, fault detection and automatic rapid recovery is the core architecture goal of HDFS. Different from general applications, applications on HDFS mainly read data by streaming HDFS is designed to be suitable for batch processing, rather than user interaction. It actually pays more attention to the high throughput of data access than the response time of data access. The typical HDFS file size is from GB to TB. Therefore, HDFS is adjusted to support large files. It should provide high aggregate data bandwidth, hundreds of nodes in a cluster, and tens of millions of files in a cluster.

Most HDFS applications require write-one-read-many access model for files. Once a file is created, written, and closed, it does not need to be modified. This assumption simplifies the problem of data consistency and makes high-throughput data access possible.

The cost of mobile computing is lower than the cost of moving data. The calculation of an application request, the closer to the data it operates, the more efficient it is, especially when the data reaches a massive level. Moving the calculation near the data is obviously better than moving the data to the place where the application is located.

Portability on heterogeneous hardware and software platforms will promote the wider adoption of HDFS as a platform for applications that require large datasets.

III. Concept introduction

Here are a few more important concepts to introduce

(1) very large files. Current hadoop clusters can store hundreds of TB or even PB-level data.

(2) streaming data access. The access mode of HDFS is: write once, read multiple times, and pay more attention to the overall time of reading the entire dataset.

(3) Commercial hardware. The equipment of HDFS cluster does not need to be expensive and special, as long as it is some ordinary hardware for daily use. Because of this, the possibility of hdfs node failure is still very high, so there must be a mechanism to deal with this single point of failure to ensure the reliability of the data.

(4) data access with low latency is not supported. Hdfs is concerned with high data throughput and is not suitable for applications that require low latency data access.

(5) write by a single user, and arbitrary modification is not supported. The data of hdfs is mainly read, and only a single writer is supported, and the write operation is always appended at the end of the text in the form of addition, and modification is not supported at any location.

4. Why do we need HDFS?

1. The amount of data is huge, and the disk begins to deal with the huge amount of information we need. Therefore, the file system is required to have large-scale distributed data storage capacity.

two。 It takes a long time to read all the data from a disk, and it takes even longer to write (the write time is usually 3 times longer than the read time). Even if the file is 1ZB, or a small 10EB, such a disk cannot be read on demand. Therefore, the file system is required to have high concurrent access ability.

3. When the size of a dataset exceeds the storage capacity of an independent physical computer, it is necessary to partition it and store it on several separate computers.

4. From the concept map, the distributed file system will increase the complexity of the system because of the incomplete structure of the distributed file system, and the introduction of network programming will also lead to more complex distributed file system. Therefore, strong fault tolerance is needed.

5.HDFS solution to the above solution is shard redundancy and local verification, which requires redundant data storage in block storage mode. Multiple shard files are directly sent to the sliced storage server for verification. The redundant shard file has an additional function, as long as one of the redundant shard files is complete, after many collaborative adjustments, the other shard files will also be complete.

After coordinated verification, the files in the whole system are intact, whether it is a transmission error, an Imax O error, or an individual server downtime.

6. The distributed file system has an unavoidable problem because the lack of files on one disk leads to the delay of read access operations, which is the main problem encountered by HDFS.

At this stage, the configuration of HDFS is optimized according to high data throughput, which may be at the cost of high time delay. But fortunately, HDFS is highly flexible and can be re-optimized for specific applications.

The summary is: load balancing can be achieved and response efficiency can be improved, because multiple servers can serve at the same time, improving efficiency.

This is the end of the article on "what is the design concept of HDFS". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.