Design ideas and goals of hadoop 07/06 Update SLTechnology News&Howtos

Design ideas and goals of hadoop

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is mainly the author's own learning process, mainly the translation and understanding of the original text, some places according to their own understanding, make some changes in the expression in order to make it easier to understand.

Official original text

Hdfs has many similarities with existing distributed file systems. However, the difference from other distributed file systems is obvious. HDFS is highly fault tolerant and is designed to be deployed on low-cost hardware. HDFS provides high-throughput access to application data for applications with large datasets. HDFS relaxed some POSIX requirements to support streaming access to file system data.

Hardware failure

First of all, make it clear: hardware failure is normal, not an accident. Detecting errors and automatic, fast recovery is the core architectural goal of hdfs

Streaming data access

Applications running on HDFS need to stream access to their datasets. They are not generic applications that typically run on a common file system. HDFS is more designed for batch processing than for user interaction. The focus is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed by applications for HDFS

Big data Collection

Applications running on HDFS have large datasets. A typical file size in HDFS is from gb to tb. Therefore, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in an instance.

Simple consistency model

HDFS applications need a write-once-read-many access model for files. Once the file is created, write and close operations are appended and truncated without modification. Content is supported to be appended to the end of the file, but cannot be updated at any point. This assumption simplifies data consistency issues and supports high-throughput data access. MapReduce applications or web crawler applications are well suited to this model.

Mobile computing is cheaper than mobile data

If the calculation requested by the application is performed near the data it operates on, it will be much more efficient. This is especially true when the size of the dataset is large. This will minimize network congestion and improve the overall throughput of the system. The assumption here is that moving the calculation closer to the data is usually better than moving the data to the place where the application runs. HDFS provides an interface for applications to move themselves closer to where the data is located.

Portability across heterogeneous hardware and software platforms

HDFS is designed to be easily portable from one platform to another. This contributes to the widespread adoption of HDFS as the preferred platform for a group of large applications.

NameNode and DataNodes

HDFS has a master / slave architecture. The HDFS cluster consists of a NameNode (optional secondary NameNode), and NameNode is a master server that manages file system namespaces and controls client access to files. In addition, there are many datanode, usually one datanode per physical node in the cluster, which manage the storage devices attached to the physical node on which they are running. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is divided into one or more blocks, which are stored in a set of DataNode. NameNode executes namespace operations, such as opening, closing, and renaming files and directories. It also determines the mapping between blocks (blocks) and data nodes (DataNodes). DataNodes is responsible for processing read and write requests from file system clients. DataNodes also performs the creation, deletion, and replication of blocks according to instructions from NameNode.

The existence of a single NameNode in the cluster greatly simplifies the system architecture. NameNode is the arbiter and repository of all HDFS metadata. The system is designed in such a way that user data itself never flows through NameNode.

The File System Namespace

HDFS supports traditional hierarchical file organization. Users or applications can create directories and store files in these directories. The file system namespace hierarchy is similar to most existing file systems; you can create and delete files, move files from one directory to another, or rename files. HDFS supports user quotas and access rights. HDFS does not support hard or soft links. However, the HDFS architecture does not preclude implementing these features.

NameNode maintains file system namespaces. Any changes to the file system namespace or its properties are recorded by NameNode. The application can specify the number of file copies that HDFS should maintain. The number of copies of a file is called the replication factor of the file. This information is stored by NameNode.

Data Replication

HDFS is designed to store very large files reliably in large clusters across machines. It stores each file as a series of blocks (block). The block of a file is copied for fault tolerance. Each file can be configured with a block size and replication factor.

All blocks in the file are the same size except the last block, and users can start a new block after adding support for variable-length blocks in append and hsync, without having to fill the last block to the configured block size.

The application can specify the number of copies of the file. The replication factor can be specified when the file is created and can be changed later. Files in HDFS are written once (except for append and truncation), and there is only one writer at any time.

NameNode makes all decisions about replicated blocks. It periodically receives heartbeats and block reports from each data node in the cluster. Receiving a heartbeat means that the DataNode is working properly. The block report contains a list of all blocks on the DataNode.

Replica Placement: The First Baby Steps

The location of the replica is critical to the reliability and performance of HDFS. Optimized replica location distinguishes HDFS from most other distributed file systems. .

NameNode uses the process outlined in Hadoop rack awareness to determine the rack id to which each DataNode belongs. A simple but not optimal strategy is to put copies on each unique rack. This prevents data loss in the event of an overall rack failure and allows the bandwidth of multiple racks to be used when reading data. This strategy distributes replicas evenly across the cluster, which makes it easy to balance the load in the event of a component failure. However, this strategy increases the cost of writing because writing requires the transfer of blocks to multiple racks.

To minimize global bandwidth consumption and read latency, HDFS attempts to satisfy read requests from the replica closest to the reader. If a copy exists on the same rack as the reader node, the copy is preferred to satisfy the read request. If the HDFS cluster spans multiple data centers, the replica that resides in the local data center rather than any remote replica is preferred.

In general, when the replication factor is 3, the placement strategy of HDFS is:

Put a copy on this machine if the writer is on a datanode, otherwise it will be placed in a random datanode, another copy on a different rack, the last copy of the datanode on the same rack as 2 on another datanode

This strategy reduces write traffic between racks, which usually improves write performance. The failure probability of the whole rack is much less than that of a node, so this strategy does not affect the guarantee of data reliability and availability. However, it does reduce the aggregate network bandwidth used when reading data because a block is placed in two racks instead of three racks. With this strategy, copies of files are not evenly distributed on the rack. 1/3 of the copies are on one node, 2/3 of the copies are on one rack, and another 1/3 are evenly distributed on the remaining racks. This strategy improves write performance without affecting data reliability or read performance.

If the replication factor is greater than 3, the fourth and subsequent replicas will be randomly selected for datanode placement, but the number of replicas per node is limited (basically (replicas-1) / racks + 2)

Because NameNode does not allow data anodes to have multiple copies of the same block, the maximum number of copies created is the total number of datanode at that time.

After adding support for Storage Types and Storage Policies to HDFS, NameNode considers these two strategies in addition to the rack awareness described above. NameNode first selects nodes based on rack awareness, and then checks whether the candidate node has the storage space required by the policy associated with the file. If the candidate node does not have a storage type, NameNode looks for another node. If enough nodes to place replicas can not be found in the first path, the NameNode looks for nodes having fallback storage types in the second path.

Replica Selection

On the whole, read nearby.

Safemode

At startup, NameNode enters a special state called Safemode. When NameNode is in the Safemode state, block replication does not occur. NameNode receives heartbeat and block report messages from datanode. The block report contains a list of blocks managed by DataNode. Each block has a specified minimum number of copies. When NameNode detects that a data block has reached the minimum number of copies it specifies, it is considered that the data block has been safely copied. After reaching a percentage of configurable "securely replicated blocks" (plus 30 seconds), NameNode exits the Safemode state. NameNode then checks the list of blocks that are still smaller than the specified number of copies, if any, and copies those blocks to other datanode.

The Persistence of File System Metadata

HDFS namespaces are stored by NameNode. NameNode uses a transaction log called EditLog to record every change that occurs in the file system metadata persistently. For example, creating a new file in HDFS causes NameNode to insert a record into the EditLog that indicates this. Similarly, changing the replication factor of a file inserts a new record into EditLog. NameNode uses a file in its local host OS file system to store EditLog. The entire file system namespace, including block-to-file mapping and file system attributes, is stored in a file named FsImage. FsImage is also stored as a file in the local file system of NameNode.

NameNode keeps a mirror of the entire file system namespace and file block mapping in memory. When NameNode starts, or when a checkpoint is triggered by a configurable threshold, it reads FsImage and EditLog from disk, applies all transactions in EditLog to the memory mirror of FsImage, and flushes this new version to a new FsImage on disk. It can then truncate (truncate) the old EditLog because its transaction has been applied to the persistent FsImage. This process is called checkpoint. The purpose of checkpoint is to ensure that HDFS has a consistent view of file system metadata by taking a snapshot of file system metadata and saving it to FsImage. Although it is valid to read FsImage, it is not valid to make incremental edits to FsImage directly. Instead of modifying the FsImage for each editor, we save the edits in the Editlog. Changes in Editlog are applied to FsImage during the checkpoint. Checkpoints can be triggered at a given time interval (dfs.namenode.checkpoint.period in seconds) or after a given number of file system transactions have been accumulated (dfs.namenode.checkpoint.txns). If these two properties are set, the first threshold reached triggers the checkpoint.

DataNode stores HDFS data in files in the local file system. DataNode does not know the HDFS file. It stores each HDFS block in a separate file in the local file system. DataNode does not create all files in the same directory. Instead, it uses a heuristic strategy to determine the optimal number of files for each directory and create subdirectories appropriately. Creating all local files in the same directory is not optimal because the local file system may not be able to effectively support a large number of files in a single directory. When DataNode starts, it scans the local file system, generates a list of all HDFS blocks corresponding to each local file, and sends the report to NameNode. The report is called Blockreport.

The Communication Protocols

All HDFS communication protocols are above the TCP/IP protocol. The client establishes a link to a configurable TCP port on NameNode, which uses ClientProtocol to communicate with NameNode. DataNode uses DataNode Protocol to communicate with NameNode. Remote procedure call (RPC) abstractly encapsulates Client Protocol and DataNode Protocol. By design, NameNode never starts any rpc. Instead, it only responds to RPC requests made by datanode or the client.

Robustness robustness

The main goal of HDFS is to store data reliably even in the event of a failure. The three common fault types are NameNode fault, DataNode fault, and network partitions.

Data Disk Failure, Heartbeats and Re-Replication

Each DataNode sends a heartbeat message to NameNode periodically. Network partitions (network partition) may cause some DataNodes to lose connectivity to the NameNode. NameNode detects this situation through the absence of Heartbeat message. NameNode marks DataNode without heartbeats as dead nodes and does not forward any new IO requests to them. Any data that has been registered to the dead DataNode is no longer available to HDFS. DataNode death may cause the replication factor of some data blocks to fall below the specified value. NameNode keeps track of which blocks need to be copied and initiates replication if necessary. The need for re-replication may be caused by a number of reasons: the DataNode may not be available, the copy may be corrupted, the hard drive on the DataNode may fail, or the replication factor of the file may increase.

Carefully set the time limit for marking DataNodes as dead (more than 10 minutes by default) to avoid replication storms caused by DataNode state jitter. Users can set shorter intervals, mark datanode as stale nodes, and avoid reading or writing on stale data nodes when configured as performance-sensitive workloads (to learn more about Users can set shorter interval to mark DataNodes as stale and avoid stale nodes on reading and/or writing by configuration for performance sensitive workloads.)

Cluster Rebalancing

The HDFS architecture is compatible with the data rebalancing scheme. If the free space on the DataNode falls below a certain threshold, the scenario may automatically move data from one DataNode to another DataNode. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.

Data Integrity

Blocks obtained from DataNode may be corrupted. This damage can occur due to storage devices, network failures, or errors in software with bug. The HDFS client software implements the checksum check of the contents of the HDFS file. When the client creates a HDFS file, it calculates the checksum for each block of the file and stores the checksum in a separate hidden file in the same HDFS namespace. When the client retrieves the contents of the file, it verifies that the data received from each DataNode matches the checksum stored in the associated checksum file. If not, the client can choose to retrieve the block from another DataNode that has a copy of the block.

File Quick and check File [root@datanode04] # ll-h / data/hadoop/tmp/dfs/data/current/BP-855898234-106.66.38.101-1483934264653/current/finalized//subdir116/subdir82total 835M RWSumi RWSumi-1 hadoop hadoop 50m May 3 05:01 blk_1114919483-rw-rw-r-- 1 hadoop hadoop 393K May 3 05:01 blk_1114919483_41260856.meta-rw-rw-r-- 1 hadoop hadoop 49m May 3 05:01 blk_1114919485-rw -rw-r-- 1 hadoop hadoop 392K May 3 05:01 blk_1114919485_41260858.meta... ... Metadata Disk Failure

FsImage and EditLog are the central data structures of HDFS. Corruption of these files may cause HDFS instances to become unavailable. Therefore, NameNode can be configured to support the maintenance of multiple copies of FsImage and EditLog. Any update to FsImage or EditLog causes each FsImage and EditLog to be updated synchronously. Synchronous updates of multiple copies of FsImage and EditLog may slow down the speed of namespace transactions that NameNode can support per second. However, this loss is acceptable because even though HDFS applications are data-intensive in nature, they are not metadata-intensive. When NameNode restarts, it selects the latest consistent FsImage and EditLog to use.

Another option to improve fault resilience is to use multiple namenode, shared storage on NFS, or distributed edit log (called Journal) to enable high availability. The latter is the recommended method.

Data OrganizationData Blocks

HDFS is designed to support very large files. HDFS-compatible applications are also those that deal with large datasets. These applications only write their data once, but they read it one or more times and require these reads to meet the stream speed. HDFS supports write-once-read-many semantics on files. The typical block size used by HDFS is 128mb, so the HDFS file is split into blocks of 128mb, and if possible, each block will reside on a different DataNode.

Replication Pipelining

When the client writes data to a HDFS file with a replication factor of 3, NameNode uses the replication destination selection algorithm to get a list of target DataNodes. This list contains the datanode that will host a copy of the block. The client then writes the first DataNode. The first DataNode begins to receive data that is shredded, writes each block to its local repository (local file system), and transfers that part of the data to the second DataNode in the list. The second DataNode then starts to receive each part of the block, writes that part to its repository, and then flushes it to the third DataNode. Finally, the third DataNode writes the data to its local repository. Therefore, the DataNode can receive data from the previous one in the pipeline and forward the data to the next in the pipeline. Therefore, data is transferred in the form of pipelilne from one DataNode to the next DataNode.

Accessibility

You can access HDFS from an application in many different ways. In essence, HDFS provides a file system Java API for applications. C language wrappers are also provided for this Java API and REST API. In addition, HTTP browsers can also be used to browse files for HDFS instances. By using the NFS gateway, HDFS can be mounted as part of the client's local file system.

Space ReclamationFile Deletes and Undeletes

Files deleted by FS Shell are not immediately deleted from HDFS Instead, HDFS moves it to a junk directory (each user is in / user//. All have their own garbage directories under trash). As long as the file is saved in garbage, it can be quickly recovered. Recently deleted files are moved to the current garbage directory (/ user//.) Trash / current), in a configurable interval, HDFS creates checkpoints for files in the current garbage directory (in / user//. Trash / below) and delete old checkpoints when they expire. For checkpoints of garbage, see FS shell's expunge command. At the end of the life cycle in the garbage, NameNode deletes the file from the HDFS namespace. Deleting a file releases the block associated with the file. Note that there may be a significant time delay between the time the user deletes the file and the corresponding increase in free space in the HDFS.

If the Recycle Bin feature is enabled, you can force deletion with the following parameters

Hadoop fs-rm-r-skipTrash delete/test2Decrease Replication Factor

When the replication factor of the file decreases, NameNode selects the extra copies that can be deleted. The next heartbeat transmits this information to DataNode. Then, DataNode deletes the corresponding block, and the corresponding free space appears in the cluster. Similarly, there may be a time delay between the completion of the setReplication API call and the presence of free space in the cluster.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.