HDFS-- block concept (read the summary, understand the information collection before the concept) 07/06 Update SLTechnology News&Howtos

HDFS-- block concept (read the summary, understand the information collection before the concept)

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1.1Block (block)

The default most basic unit of storage for 1.1.1.HDFS (Hadoop Distributed File System) is 64m blocks of data.

Files on the HDFS are divided into block-sized chunks as separate storage units, called data blocks

1.1.3. Unlike ordinary file systems, in HDFS, if a file is less than the size of a block, it does not take up the entire block storage space.

Each disk has a default block size, which is the smallest unit of data read / write to the disk. A file system built on a single disk manages the blocks in the file system through disk blocks, which can be integral times the size of the disk block. File system blocks are typically thousands of bytes, while disk blocks are typically 512 bytes. This information (file system block size) is transparent to file system users who need to read / write files. However, the system still provides tools, such as df and fsck, to maintain the file system, which manipulates blocks in the file system.

First explain what a "disk block" is. To be clear: 1. We know that the smallest unit of communication between the operating system and the disk is the disk block, which is a virtual concept. Is a meaningful concept for the operating system (software). 2. The smallest unit for reading and writing to a disk is the sector, which is real. Is the hardware part, is a real physical area. Because we often come into contact with the software part, not the hardware part, so we often mention the disk block. Not the sector.

3. Size of disk block: block = sector x 2 ^ n. Because the basic unit of reading and writing to the disk is the sector, and there is a block between the operating system and the disk, the most economical way for the system to read and write is the integer multiple of the sector.

Three nodes of HDFS: Namenode,Datanode,Secondary Namenode

Namenode:HDFS daemon, used to manage the namespace of the file system, is responsible for recording how the file is divided into data blocks, and these data blocks are stored on those data nodes respectively. Its main function is to centrally manage memory and IO.

Datanode: the worker node of a file system that stores and retrieves blocks as needed and periodically sends a list of blocks they store to namenode.

Secondary Namenode: an auxiliary daemon that communicates with NameNode to keep snapshots of HDFS metadata on a regular basis.

HDFS Federation (Federal HDFS):

Extensions are implemented by adding namenode, where each namenode manages a portion of the file system namespace. Each namenode maintains a namespace volume, including the source data of the namespace and the block pool of all blocks of files under that namespace.

High availability of HDFS (High-Availability)

The 2.x release of Hadoop adds support for high availability (HA) in HDFS. In this implementation, a pair of active-active-standby namenode is configured. When the active namenode fails, the standby namenode takes over its tasks and begins to serve requests from the client without significant disruption.

The implementation of the architecture includes:

The sharing of editing logs is realized through highly available shared storage among namenode.

Datanode sends block processing reports to both namenode simultaneously.

The client uses a specific mechanism to deal with the failure of namenode, which is transparent to users.

Failover controller: manages the conversion process of transferring the active namenode to the standby namenode, which is based on the ZooKeeper and thus ensures that there is and only one active namenode. Each namenode runs a lightweight failover controller whose job is to monitor whether the host namenode fails and fail over when the namenode fails.

3. Command line interface

Two property items: fs.default.name is used to set the default file system for Hadoop, and hdfs URL is used to configure HDFS as the default file system for Hadoop. Dfs.replication sets the number of copies of file system blocks

4. Hadoop file system

Hadoop has an abstract file system concept, and HDFS is just one of the implementations. The Java abstract interface org.apache.hadoop.fs.FileSystem defines a file system interface in Hadoop. The concrete implementation of the abstract class implementation HDFS is hdfs.DistributedFileSystem

Https://www.cnblogs.com/caiyisen/p/7395843.html

Relationship between HDFS and linux file system:

Each disk has a default block size, which is the smallest unit of data read / write to the disk. The file system built on a single disk (linux file system) manages the blocks in the file system through disk blocks, and the file size in this file system is an integral multiple of the disk block.

HDFS also has the concept of blocks. Blocks in the HDFS file system refer to linux files, and distributed files are composed of multiple linux files (blocks). The smallest block unit is the size of a Linux file, which defaults to 64MB

It can be seen that the hdfs file system does not manage disks directly.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.