How does HDFS work? 04/25 Update SLTechnology News&Howtos

How does HDFS work?

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how HDFS works. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

How HDFS works

HDFS supports rapid data transfer between compute nodes. At the beginning, it is tightly coupled to MapReduce-- MapReduce is a programming framework for parallel operations for large datasets.

When HDFS receives data, it decomposes the information into separate blocks and distributes them to different nodes in the cluster, thus supporting efficient parallel processing.

In addition, HDFS is specially designed to have high fault tolerance. HDFS can copy each piece of data multiple times and distribute the copy to each node, placing at least one copy on another server rack. Therefore, the data on the crash node can also be found elsewhere in the cluster. This ensures that processing can continue when the data is recovered.

HDFS uses the master / slave architecture. In its original version, each Hadoop cluster consisted of a NameNode to manage file system operations and support for DataNode to manage data storage on a single compute node. These HDFS elements are combined to support applications with large datasets.

This master node "data chunking" architecture draws some of the design guidance from Google File system (GFS) and IBM's Universal parallel File system (GPFS). GFS is an extensible distributed file system for large, distributed applications that access large amounts of data. It runs on cheap ordinary hardware and provides fault tolerance, which can provide a large number of users with services with high overall performance. GPFS is a high-performance and scalable parallel file system specially designed for the cluster environment. It can quickly access files in the shared file system among multiple nodes in the cluster, and provide a stable fault recovery and fault-tolerant mechanism. In addition, although HDFS is not compatible with the Portable operating system Interface (POSIX) model, it also echoes the POSIX design style in some ways.

HDFS architecture diagram-applications interact with NameNode and DataNode through Client

Why use HDFS

HDFS was originally created by Yahoo to meet part of the company's advertising services and search engine needs. Like other Web-oriented companies, Yahoo finds that it has more and more access to the applications it needs to deal with, and these users are generating more and more data. Later, companies such as Facebook, eBay and Twitter also began to use HDFS as the basis for big data's analysis to address the same needs.

But the use of HDFS is much more than that. The large-scale Web search mentioned above can be classified as data-intensive parallel computing. In addition, HDFS is also often used in computing-intensive parallel computing scenarios, such as meteorological computing. It is also widely used in data-intensive and computing-intensive mixed scenes, such as 3D modeling and rendering. HDFS is also the core of many open source data warehouses (sometimes called data Lake, Data Lake).

HDFS is usually used for large-scale deployment because it has an important feature, which is that it can run on ordinary cheap machines. Also, systems such as those that run Web searches and related applications often need to scale to hundreds of PB and thousands of nodes, so the system must have extensible features, which is exactly what HDFS has. In addition, server failures are common on this scale, and the fault tolerance provided by HDFS is of practical value in this regard.

Does not apply to HDFS scenarios

First of all, HDFS is not suitable for scenarios that require high latency, such as real-time queries. HDFS does not have a sufficient advantage in terms of latency. Second, HDFS is also difficult to support the storage of a large number of small files. In Hadoop systems, "small files" are usually defined as files that are much smaller than HDFS's block size (the default 64MB). Because each file produces its own MetaData metadata, Hadoop stores this information through Namenode. If there are too many small files, it is easy to take up a lot of NameNode memory, and it will also make the seek time exceed the read time, bringing performance bottlenecks to the system.

In addition, HDFS does not support multi-user writes, nor can files be randomly modified. It is only supported by append, that is, by appending to the end of the file. HDFS is suitable for storing semi-structured and unstructured data. If the data is strictly structured, it is not appropriate to force HDFS. Finally, HDFS is suitable for TB, PB-level big data processing, the number of files is usually more than one million, if the amount of data is very small, it is not necessary to use HDFS.

History of HDFS and Hadoop

Here are some key time nodes. In 2006, the Apache Hadoop project was officially launched, and HDFS and MapReduce began to develop independently. The software has been widely used in big data analysis projects in various industries. In 2012, HDFS and Hadoop version 1. 0 was released.

General YARN Explorer was added to Hadoop 2.0 in 2013, and MapReduce and HDFS were effectively decoupled. Since then, Hadoop supports a variety of data processing frameworks and file systems. Although MapReduce is often replaced by Apache Spark, HDFS is still a popular file format for Hadoop.

After the release of four alpha versions and one beta version, Apache Hadoop 3.0.0 became generally available in December 2017, with HDFS enhancements supporting additional NameNode, erasure coding tools, and greater data compression. At the same time, advances in HDFS tools, such as LinkedIn's open source Dr. Elephant and Dynamometer performance testing tools, enable HDFS to support further development and implementation.

This is the end of the article on "how HDFS works". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.