Introduction to HDFS 07/04 Update SLTechnology News&Howtos

Introduction to HDFS

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Hadoop Distributed File System

HDFS Hadoop Distributed Filesystem

Distributed file system: When the data set is large enough to exceed the storage capacity of a single computer, it needs to be stored on several separate computers. A file system that manages storage across multiple computers in a network is called a distributed file system.

Distributed file system complexity: node failures need to be considered without losing any data.

HDFS stores very large files in streaming data access mode and runs on commodity hardware clusters.

1. Large files: terabytes or even petabytes.

2. Stream data access: Write once, read many is the most efficient access mode. Various types of analyses are performed on the dataset over time, and each analysis involves most or even all of the dataset.

3. Commercial hardware: no need for expensive and highly reliable hardware, commercial hardware, high failure rate, but undetected by users.

Not suitable for running on HDFS:

1. Low latency data access: High data throughput at the expense of high time latency.

2. Large number of small files: namenode stores file system metadata in memory, and the total number of files that can be stored is limited by the total memory of namenode.

3. Multi-user write, arbitrary file modification: a writer, write operations always add data to the end of the file.

Data block:

HDFS data block, default 64MB.

HDFS is faster than disk blocks in order to minimize addressing overhead. If the block is set large enough, the time to transfer data from disk can be significantly longer than the time required to locate the beginning of the block. Thus, the time to transfer a file consisting of multiple blocks depends on the disk transfer rate.

The abstract benefits of the block concept:

1. The size of a file can be larger than the capacity of any disk on the network. All blocks of a file need not be stored on the same disk.

2. Simplify storage management by using blocks instead of files as storage units.

3. Blocks are ideal for data backup to provide data fault tolerance and availability.

namenode and datanode

HDFS clusters have two types of nodes and operate in a manager-worker mode, namely a namenode (manager) and multiple datanodes (workers).

namenode manages the file system namespace, maintaining the file system tree and all files and directories within the tree.

Datanodes are worker nodes of a file system that store and retrieve data blocks as needed (scheduled by clients or namenode) and periodically send namenode lists of the blocks they store.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.