The HDFS Reading and Writing principle of Hadoop 07/16 Update SLTechnology News&Howtos

The HDFS Reading and Writing principle of Hadoop

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. basic concepts of HDFS

HDFS's full name is Hadoop Distributed System. HDFS is designed for streaming access to large files. Suitable for hundreds of MB,GB and TB, and write once read many times of the situation. It is not very suitable for low-latency data access, a large number of small files, simultaneous writing and arbitrary file modification.

At present, in addition to Java, HDFS supports Thrift, C, FUSE, WebDAV, HTTP and so on. HDFS organizes its file content by block-sized chunk, and the default block size is 64MB. For files that are less than 64MB, it will occupy a block, but it does not actually occupy the 64MB on the actual hard disk. It can be said that HDFS is an intermediate layer on top of the file system. The default block size is set to 64MB because block-sized is very helpful for file positioning, and large files make the transfer time much longer than the file search time, thus maximizing the proportion of file location time in the total file acquisition time.

II. Design principles of HDFS

HDFS is an open source implementation of Google's GFS (Google File System). Have the following five basic goals:

1. Hardware errors are normal, not errors. HDFS generally runs on ordinary hardware, so hardware errors are normal. Therefore, in HDFS, error detection and rapid automatic recovery is the core design goal of HDFS.

2. Streaming data access. The applications running on HDFS are mainly batch processing, not user interactive transactions, and most of them are streaming data reading.

3. Large-scale data set. The typical file size in HDFS is GB or TB.

4. The principle of simple consistency. HDFS applications generally operate on files in the mode of writing once and reading out multiple times. Once the file is created, written, and closed, the general content of the file changes again. This simple principle of consistency makes high-throughput data access possible.

5. The principle of data proximity. HDFS provides an interface for applications to move their own execution code to the data node for execution. The main reason for this approach is that mobile computing is more cost-effective than mobile data. Compared with big data / large file in HDFS, the code of mobile computing is more cost-effective compared with mobile data. This way can provide broadband utilization, increase system throughput and reduce network congestion.

III. The architecture of HDFS

HDFS is mainly composed of Namenode (master) and a series of Datanode (workers).

Namenode is the directory tree and associated file metadata that manages HDFS. This information is stored on local disk in the form of "namespacep_w_picpath" and "edit log" files, but these files are reconstructed each time HDFS is restarted.

Datanode is the node that accesses the actual contents of the file, and Datanodes periodically reports the list of block to Namenode.

Because Namenode is the node where metadata is stored, if Namenode is down, HDFS cannot work properly, so it is generally used to persist metadata on a local or remote machine, or to use secondary namenode to regularly synchronize the metadata information of Namenode. Secondary namenode is somewhat similar to Slave in MySQL's Master/Salves, and "edit log" is similar to "bin log". If the Namenode fails, it is common to copy the persisted metadata from the original Namenode to the secondary namenode to make the secondary namenode run as the new Namenode.

HDFS is a master slave structure (master/slave). As shown in the figure:

IV. HDFS Reliability guarantee measures

One of the main design goals of HDFS is to ensure the reliability of data storage in case of failure.

HDFS has a complete redundant backup and fault recovery mechanism. Generally, the number of backup copies is set through dfs.replication. The default is 3.

1. Redundant backup. Write data to multiple DataNode nodes, and when some of these nodes are down, you can also obtain data from other nodes and copy them to other nodes, bringing the number of backups to the set value. Dfs.replication sets the number of backups.

2. Copy storage. HDFS adopts rack-aware (Rack-aware) strategy to improve data reliability, availability and network broadband utilization. When the replication factor is 3, HDFS's replica storage strategy is: the first copy is placed on another node in the same rack (executed in the cluster) / a random node (executed outside the cluster). The second copy is placed on any other node in the local rack. The third copy is placed at any node in the other rack. This strategy can prevent data loss when the entire rack fails, and can also make full use of the high broadband special effects in the rack.

3. Heartbeat detection. NameNode periodically receives heartbeat and block reports from each DataNode in the cluster, and NameNode validates the mapping and other file system metadata against these reports. When the NameNode cannot receive the heartbeat report from the DataNode node, NameNode marks the DataNode as down, and NameNode will not send any more IO actions to the DataNode node. At the same time, the downtime of DataNode may also lead to data replication. Rereplicas are generally caused by multiple reasons: DataNode unavailable, data copy corruption, disk error on DataNode, or increased replication factor.

4. Safe mode. In the HDFS system, you first go through a full mode, in which block writes are not allowed. NameNode will detect that the number of block replicas on the DataNode does not reach the minimum number of replicas, then it will enter full mode and start replication, only when the number of replicas is greater than the minimum number of replicas, then it will automatically leave safe mode. Effective proportion of DataNode nodes: dfs.safemode.threshold.pct (default is 0.999f), so when the loss of DataNode nodes reaches 1-0.999f, it will enter safe mode.

5. Data integrity testing. HDFS implements checksum detection of the contents of the HDFS file (CRC cyclic checkcode), and also writes the checksum of the data block to a hidden file when writing to the data file (). When the client acquires the file, it checks whether the checksum corresponding to the database obtained from the DataNode node is consistent with the checksum in the hidden file. If not, the client will think that the database is corrupted, will obtain the data block from other DataNode nodes, and report the data block information of the DataNode node of the NameNode node.

6. Recycle Bin. Files deleted in HDFS are first saved to a folder (/ trash) to facilitate data recovery. When the deletion time exceeds the set time valve (the default is 6 hours), HDFS will delete the data block completely.

Image files and transaction logs. These two kinds of data are the core data structures in HDFS.

8. Snapshot.

5. Data storage operation

1. Data storage: block

The default block size is 128MB and is configurable. If the file size is less than 128MB, it will be saved as a separate block.

Why are the data blocks so large? Data transmission time exceeds seek time (high throughput)

A file storage method? It is split into several block by size and stored on different nodes, with three copies of each block by default.

2. Data storage: staging

When HDFSclient uploads data to HDFS, first of all, cache the data locally, and when the data reaches a block size, request NameNode to allocate a block. NameNode will tell HDFS client the address of the DataNode where block is located. HDFS client communicates directly with DataNode and writes the data to a block file on the DataNode node.

VI. Write data

1. Initialize the FileSystem, and the client calls create () to create the file.

2.FileSystem calls the metadata node with RPC to create a new file in the namespace of the file system. The metadata node first determines that the file does not exist and that the client has permission to create the file, and then creates a new file.

3.FileSystem returns DFSOutputStream, the client writes data, and the client starts writing data.

4.DFSOutputStream divides the data into blocks and writes it to data queue. Data queue is read by Data Streamer and tells the metadata node to allocate data nodes to store data blocks (3 blocks are replicated by default). The assigned data nodes are placed in a pipeline. Data Streamer writes the block to the first data node in the pipeline. The first data node sends the data block to the second data node. The second data node sends the data to the third data node.

5.DFSOutputStream saves the ack queue for the outgoing data block, waiting for the data node in the pipeline to tell you that the data has been written successfully.

6. When the client finishes writing data, the close function of stream is called. This operation writes all data blocks to the data node in pipeline and waits for ack queue to return success. Finally, the metadata node is notified that the write is complete.

7. If the data node fails in the process of writing, close the pipeline, put the data block in the ack queue into the beginning of the data queue, and the current data block is given a new mark by the metadata node in the already written data node, the error node can detect that its data block is outdated and will be deleted after restart. Failed data nodes are removed from the pipeline, and additional data blocks are written to the other two data nodes in the pipeline. The metadata node is informed that the block is not replicated enough and a third backup will be created in the future.

8. If an error occurs in a datanode during writing, the following steps are taken:

1) pipeline is shut down

2) in order to prevent packet loss, the packet in ack quene will be synchronized to data quene.

3) delete the block that is currently being written but not completed on the datanode that caused the error

4) the rest of the block is written into the remaining two normal datanode

5) namenode finds another datanode to create a copy of this block. Of course, these operations are imperceptible to the client.

Java code

Configuration conf = new Configuration (); FileSystem fs = FileSystem.get (conf); Path file = new Path ("demo.txt"); FSDataOutputStream outStream = fs.create (file); outStream.writeUTF ("Welcome to HDFSJava API requests!"); outStream.close ()

Write process picture:

VII. Reading process

1. Initialize the FileSystem, and then the client (client) opens the file with the open () function of FileSystem.

2.FileSystem calls the metadata node with RPC to get the block information of the file. For each data block, the metadata node returns the address of the data node where the data block is saved.

3.FileSystem returns FSDataInputStream to the client to read the data, and the client calls the read () function of stream to start reading the data.

4.DFSInputStream connection holds the nearest data node of the first data block of this file, and data reads from the data node to the client (client).

5. When this data block is read, DFSInputStream closes the connection to this data node and then connects to the nearest data node of the next data block in this file.

6. When the client finishes reading the data, it calls the close function of FSDataInputStream.

7. In the process of reading data, if the client has an error communicating with the data node, it attempts to connect to the next data node that contains this data block.

8. Failed data nodes will be logged and will not be connected later.

Java code

Configurationconf = new Configuration (); FileSystemfs = FileSystem.get (conf); Pathfile = new Path ("demo.txt"); FSDataInputStreaminStream = fs.open (file); Stringdata = inStream.readUTF (); System.out.println (data); inStream.close ()

Read the file process picture:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.