How to Design HDFS distributed File system 04/11 Update SLTechnology News&Howtos

How to Design HDFS distributed File system

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to design a HDFS distributed file system. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Text

The design and concept of HDFS

The HDFS cluster is a typical master/slave architecture, and the master node is called the NameNode,salve node and the DataNode. The simplest HDFS cluster is a NameNode node and multiple DataNode nodes. The architecture of the HDFS cluster is as follows:

Block: data blocks. The HDFS cluster divides the stored files into multiple chunks, which are used as independent storage units. The default size is 128m. If a file exceeds the single storage capacity of the cluster, partitioning can solve this problem; secondly, storing and backing up according to blocks can simplify the design of the system. The default block size modifies dfs.blocksize configuration in the hdfs-site.xml file.

The Master node of the NameNode:HDFS cluster maintains the directory structure (namespace) of the cluster files and edits the log files, while recording the information of the data nodes where each block of the file resides in memory.

The Slave node of the DataNode:HDFS cluster is responsible for storing the actual data. Store and retrieve blocks as needed, and periodically send a list of blocks they store to NameNode. In order to achieve high reliability of data storage, HDFS stores a block in different DataNode nodes, with a default of 3. The default value can be modified through the dfs.replication configuration in the hdfs-site.xml file. If the block in the current DataNode is corrupted, you can copy the correct block from another DataNode node.

These are several important concepts that are obvious in the architecture diagram, and then we will introduce several important concepts hidden in the lower framework diagram in combination with the high availability and extensibility in the architecture design.

Federal HDFS: this is mainly to solve the problem of scalability. We know that the memory of NameNode processes stores the corresponding relationship between data and data location. For a cluster with a large amount of file data, the memory of NameNode will become a bottleneck for the expansion of the cluster scale. Therefore, a single NameNode cluster is not desirable. The release of Hadoop 2.x introduces federated HDFS that allows you to add NameNode nodes to the cluster for scale-out. Each NameNode manages a portion of the namespace, and each NameNode maintains a namespace volume (namespace volume), which is independent of each other, and the failure of one NameNode does not affect the namespaces maintained by other NameNode.

HDFS HA: this solution is highly available, or HDFS High Available. A pair of active-standby (active-standby) NameNode is configured in this implementation. When the active NameNode fails, the standby NameNode takes over the corresponding task, which is transparent to the user. To implement this design, you need to make the following architectural changes:

1. The editing log is shared between the two NameNode of HA through highly available shared storage, in order to synchronize the status of the standby NameNode and the primary NameNode after taking over the work. QJM (quorum journal manager) is designed to provide a highly available log editor and is recommended for use in most HDFS clusters. QJM runs as a set of log nodes, usually 3, and each edit must be written to most log nodes, so the system can tolerate the loss of any node, and the log node is JournalNode.

2. DataNode needs to send data reports to both NameNode at the same time, because the mapping information of data blocks is stored in NameNode memory.

3. The client needs to deal with the problem of NameNode failure and be transparent to users.

Basic operation of HDFS

Command line interface

Command line interface is the simplest and most convenient way to operate HDFS. HDFS commands are very similar to Linux native commands. You can view all the commands supported by HDFS through the hadoop fs help command, and then introduce the following commonly used commands:

Hadoop fs-put # uploads local files to HDFS

Hadoop fs-ls # is similar to the Linux ls command

Hadoop fs-cat # View HDFS file data

Hadoop fs-text # is the same as the cat command. You can view SequenceFile and zip files.

Hadoop fs-rm # deletes the HDFS file or directory.

These are the more commonly used HDFS commands, and viewing the help documentation can add some command-line options to each command to output different information. Taking the ls command as an example, take a look at the file information output from HDFS.

Hadoop fs-ls / hadoop-ex/wordcount/input

-rw-r--r-- 3 root supergroup 32 2019-03-03 01:34 / hadoop-ex/wordcount/input/words

-rw-r--r-- 3 root supergroup 28 2019-03-03 01:46 / hadoop-ex/wordcount/input/words2

You can find that the output is similar to the ls command under Linux. Part 1 shows the file type and permissions, part 2 is the number of copies 3, parts 3 and 4 are the users and user groups to which they belong, part 5 is the file size, if the directory is 0, parts 6 and 7 are the date and time when the file was modified, and part 8 is the path and name of the file. There is a super user in HDFS, that is, the user who started NameNode.

Java interface

The Java interface is more flexible and powerful than the command line interface. However, it is not very convenient to use. Generally, you can use the Java interface to read the data on HDFS in MR or Spark tasks. This chapter only gives an example of reading HDFS file data to introduce the use of Java interface, mainly using FileSystem API to achieve, more specific and more usage readers can consult.

Package com.cnblogs.duma.hdfs

Import org.apache.hadoop.conf.Configuration

Import org.apache.hadoop.fs.FileSystem

Import org.apache.hadoop.fs.Path

Import org.apache.hadoop.io.IOUtils

Import java.io.IOException

Import java.io.InputStream

Import java.net.URI

Public class FileSystemEx {

Public static void main (String [] args) throws IOException {

Configuration conf = new Configuration ()

/ / uri is the value of fs.defaultFS configuration in the core-site.xml file

FileSystem fs = FileSystem.get (URI.create ("hdfs://hadoop0:9000"), conf)

InputStream in = null

Try {

/ / specify the open file

In = fs.open (new Path ("/ hadoop-ex/wordcount/input/words"))

/ / copy the input stream to the standard output stream

IOUtils.copyBytes (in, System.out, 4096, false)

} catch (IOException e) {

E.printStackTrace ()

} finally {

/ / close the input stream

IOUtils.closeStream (in)

}

This is the end of this article on "how to design a HDFS distributed file system". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.