In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article is about what types of nodes there are in the HDFS architecture. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
There are two types of nodes in the HDFS architecture, one is NameNode, also known as "metadata node", and the other is DataNode, also known as "data node". These two types of nodes are responsible for the specific tasks of Master and Worker respectively.
1) the metadata node is used to manage the namespace of the file system
It saves the metadata of all files and folders in a file system tree.
This information will also be saved to the following files on the hard drive: namespace image (namespace image) and modification log (edit log).
It also saves which data blocks are included in a file and which data nodes are distributed. However, this information is not stored on the hard disk, but is collected from the data node when the system is started.
2) the data node is the place where the data is really stored in the file system.
The client (client) or metadata information (namenode) can request from the data node to write or read the data block.
It periodically reports its stored block information to the metadata node.
3) from the metadata node (secondary namenode)
The slave metadata node is not a backup node when there is a problem with the metadata node, it is responsible for different things from the metadata node.
Its main function is to periodically merge the namespace image file of the metadata node and the modification log to prevent the log file from being too large. This will be believed in the narrative below.
The merged namespace image file is also saved from the metadata node in case the metadata node fails.
2.3 metadata node directory structure
The VERSION file is a java properties file that holds the version number of the HDFS.
NamespaceID=1232737062
CTime=0
StorageType=NAME_NODE
LayoutVersion=-18
2.4 Directory structure of data nodes
3. HDFS architecture
HDFS is a master / slave (Mater/Slave) architecture that, from the end user's point of view, can perform CRUD (Create, Read, Update, and Delete) operations on files through directory paths, just like traditional file systems. However, due to the nature of distributed storage, the HDFS cluster has a NameNode and some DataNode. NameNode manages the metadata of the file system and DataNode stores the actual data. The client accesses the file system through interaction with NameNode and DataNodes. The client contacts the NameNode to get the metadata of the file, while the real file Icano operation interacts directly with the DataNode.
Figure 5-1-1 lists the HDFS file
2) list the files in a document under the HDFS directory
Here is how to browse the files in the document named "input" under HDFS through the "- ls file name" command:
Hadoop fs-ls input
The execution result is shown in figure 5-1-2.
Figure 5-1-3 successfully upload file to HDFS
4) copy the files in HDFS to the local system
Here is how to copy the "output" file in HDFS to your local system and name it "getout" through the "- get file 1 file 2" command.
Hadoop fs-get output getout
The execution result is shown in figure 5-1-4.
Figure 5-1-5 successfully delete the newoutput document under HDFS
6) View a file under HDFS
Here is how to view the contents of the input file under HDFS through the "- cat file" command:
Hadoop fs-cat input/*
The execution result is shown in figure 5-1-6.
Figure 5-2-1 basic HDFS statistics
2) exit safe mode
NameNode automatically enters safe mode when it starts. Safe mode is a state of NameNode, and at this stage, the file system does not allow any modification. The purpose of safe mode is to check the validity of data blocks on each DataNode at system startup, and to copy or delete data blocks as necessary according to policy, and automatically exit safe mode when the minimum percentage of data blocks meets the minimum number of copies condition.
The system displays "Name node is in safe mode", indicating that the system is in safe mode. You only need to wait 17 seconds, or you can exit safe mode with the following command:
Hadoop dfsadmin-safemode enter
The result of successfully exiting safe mode is shown in figure 5-2-2.
Figure 5-2-3 enter HDFS security mode
4) add nodes
Scalability is an important feature of HDFS, and it is easy to add nodes to a HDFS cluster. To add a new DataNode node, first install Hadoop on the Singapore node, use the same configuration as NameNode (which can be copied directly from NameNode), modify the "/ usr/hadoop/conf/master" file, and add the NameNode hostname. Then modify the "/ usr/hadoop/conf/slaves" file on the NameNode node, add the new node hostname, establish a password-free SSH connection to the new node, and run the startup command:
Start-all.sh
5) load balancing
HDFS data can be unevenly distributed across DataNode, especially when DataNode nodes fail or new DataNode nodes are added. When adding new data blocks, NameNode's selection strategy for DataNode nodes may also lead to uneven distribution of data blocks. You can use the command to rebalance the distribution of blocks on the DataNode:
Start-balancer.sh
Before executing the command, the data distribution on the DataNode node is shown in figure 5-2-4.
Execute the load balancing command as shown in figure 5-2-6.
Figure 6-1-1 running results (1)
2) Project browser
Figure 6-1-3 running results (3)
6.2 create a HDFS file
"FileSystem.create (Path f)" allows you to create a file on HDFS, where f is the full path to the file. The specific implementation is as follows:
Package com.hebut.file
Import org.apache.hadoop.conf.Configuration
Import org.apache.hadoop.fs.FSDataOutputStream
Import org.apache.hadoop.fs.FileSystem
Import org.apache.hadoop.fs.Path
Public class CreateFile {
Public static void main (String [] args) throws Exception {
Configuration conf=new Configuration ()
FileSystem hdfs=FileSystem.get (conf)
Byte [] buff= "hello hadoop world!\ n" .getBytes ()
Path dfs=new Path ("/ test")
FSDataOutputStream outputStream=hdfs.create (dfs)
OutputStream.write (buff,0,buff.length)
}
}
The running results are shown in figures 6-2-1 and 6-2-2.
1) Project browser
Figure 6-2-2 running results (2)
6.3 create a HDFS directory
"FileSystem.mkdirs (Path f)" allows you to create a folder on HDFS, where f is the full path to the folder. The specific implementation is as follows:
Package com.hebut.dir
Import org.apache.hadoop.conf.Configuration
Import org.apache.hadoop.fs.FileSystem
Import org.apache.hadoop.fs.Path
Public class CreateDir {
Public static void main (String [] args) throws Exception {
Configuration conf=new Configuration ()
FileSystem hdfs=FileSystem.get (conf)
Path dfs=new Path ("/ TestDir")
Hdfs.mkdirs (dfs)
}
}
The running results are shown in figures 6-3-1 and 6-3-2.
1) Project browser
Figure 6-3-2 running results (2)
6.4 rename HDFS Fil
The specified HDFS file can be renamed with "FileSystem.rename (Path src,Path dst)", where src and dst are the full paths to the file. The specific implementation is as follows:
Package com.hebut.file
Import org.apache.hadoop.conf.Configuration
Import org.apache.hadoop.fs.FileSystem
Import org.apache.hadoop.fs.Path
Public class Rename {
Public static void main (String [] args) throws Exception {
Configuration conf=new Configuration ()
FileSystem hdfs=FileSystem.get (conf)
Path frpaht=new Path ("/ test"); / / Old file name
Path topath=new Path ("/ test1"); / / New file name
Boolean isRename=hdfs.rename (frpaht, topath)
String result=isRename? "success": "failure"
System.out.println ("File renaming result is:" + result)
}
}
The running results are shown in figures 6-4-1 and 6-4-2.
1) Project browser
Figure 6-4-2 running results (2)
6.5 Delete files on HDFS
The specified HDFS file can be deleted through "FileSystem.delete (Path fjore Boolean recursive)", where f is the full path of the file to be deleted, and recuresive is used to determine whether to delete it recursively. The specific implementation is as follows:
Package com.hebut.file
Import org.apache.hadoop.conf.Configuration
Import org.apache.hadoop.fs.FileSystem
Import org.apache.hadoop.fs.Path
Public class DeleteFile {
Public static void main (String [] args) throws Exception {
Configuration conf=new Configuration ()
FileSystem hdfs=FileSystem.get (conf)
Path delef=new Path ("/ test1")
Boolean isDeleted=hdfs.delete (delef,false)
/ / Recursive deletion
/ / boolean isDeleted=hdfs.delete (delef,true)
System.out.println ("Delete?" + isDeleted)
}
}
The running results are shown in figures 6-5-1 and 6-5-2.
1) console results
Figure 6-5-2 running results (2)
6.6 Delete directories on HDFS
Just like deleting the file code, just delete the directory path instead, and if there are files in the directory, delete them recursively.
6.7 check to see if a HDFS file exists
"FileSystem.exists (Path f)" lets you see whether the specified HDFS file exists, where f is the full path to the file. The specific implementation is as follows:
Package com.hebut.file
Import org.apache.hadoop.conf.Configuration
Import org.apache.hadoop.fs.FileSystem
Import org.apache.hadoop.fs.Path
Public class CheckFile {
Public static void main (String [] args) throws Exception {
Configuration conf=new Configuration ()
FileSystem hdfs=FileSystem.get (conf)
Path findf=new Path ("/ test1")
Boolean isExists=hdfs.exists (findf)
System.out.println ("Exist?" + isExists)
}
}
The running results are shown in figures 6-7-1 and 6-7-2.
1) console results
Figure 6-7-2 running results (2)
6.8 View the last modification time of the HDFS file
Use "FileSystem.getModificationTime ()" to view the modification time of the specified HDFS file. The specific implementation is as follows:
Package com.hebut.file
Import org.apache.hadoop.conf.Configuration
Import org.apache.hadoop.fs.FileStatus
Import org.apache.hadoop.fs.FileSystem
Import org.apache.hadoop.fs.Path
Public class GetLTime {
Public static void main (String [] args) throws Exception {
Configuration conf=new Configuration ()
FileSystem hdfs=FileSystem.get (conf)
Path fpath = new Path ("/ user/hadoop/test/file1.txt")
FileStatus fileStatus=hdfs.getFileStatus (fpath)
Long modiTime=fileStatus.getModificationTime ()
System.out.println ("modification time of file1.txt is" + modiTime)
}
}
The running result is shown in figure 6-8-1.
Figure 6-9-1 running results (1)
2) Project browser
Figure 6-10-1 running results (1)
2) Project browser
Figure 6-11-1 console results
7. Read and write data flow of HDFS
The process of writing to a file is more complex than reading:
1) explain one
2) explanation 2
Use the client development library provided by HDFS to initiate a RPC request to the remote Namenode
Namenode will check whether the file to be created already exists and whether the creator has permission to operate. If it succeeds, it will create a record for the file, otherwise it will cause the client to throw an exception.
When the client starts to write to the file, the development library splits the file into multiple packets, manages the packets internally as "data queue", and requests a new blocks from the Namenode to get the appropriate datanodes list to store the replicas, depending on the size of the replication set in the Namenode.
Start writing packet to all replicas in the form of pipeline (pipes). The development library writes the packet to the first datanode as a stream, which stores the packet and then passes it to the next datanode in the pipeline until the last datanode, which writes data in pipelined form.
After the last datanode is successfully stored, an ack packet is returned, which is passed to the client in pipeline, and the "ack queue" is maintained inside the client's development library. When the ack packet returned by datanode is successfully received, the corresponding packet is removed from "ack queue".
If a datanode fails during transmission, the current pipeline will be closed, the failed datanode will be removed from the current pipeline, the remaining block will continue to be transmitted in the form of pipeline in the remaining datanode, and the Namenode will allocate a new datanode to maintain the number set by replicas.
Close pipeline and put the data blocks in ack queue into the beginning of data queue.
If the current data block is given a new mark by the metadata node in the written data node, the error node can detect that its data block is outdated and will be deleted after restart.
Failed data nodes are removed from the pipeline, and additional data blocks are written to the other two data nodes in the pipeline.
The metadata node is informed that the block is not replicated enough and a third backup will be created in the future.
The client calls create () to create the file
DistributedFileSystem invokes the metadata node with RPC to create a new file in the file system's namespace.
The metadata node first determines that the file does not exist and that the client has permission to create the file, and then creates a new file.
DistributedFileSystem returns DFSOutputStream, and the client is used to write data.
The client begins to write data, and DFSOutputStream divides the data into blocks and writes it to data queue.
Data queue is read by Data Streamer and tells the metadata node to allocate data nodes to store data blocks (3 blocks are replicated by default). The assigned data nodes are placed in a pipeline.
Data Streamer writes the block to the first data node in the pipeline. The first data node sends the data block to the second data node. The second data node sends the data to the third data node.
DFSOutputStream saves the ack queue for the outgoing data block, waiting for the data node in the pipeline to tell you that the data has been written successfully.
If the data node fails during the write process:
When the client finishes writing data, the close function of stream is called. This operation writes all data blocks to the data node in pipeline and waits for ack queue to return success. Finally, the metadata node is notified that the write is complete.
Use the client development library provided by HDFS to initiate a RPC request to the remote Namenode
Namenode returns part or all of the block list of the file as appropriate, and for each block,Namenode, it returns the datanode address with a copy of the block.
The client development library selects the datanode closest to the client to read the block.
After reading the data of the current block, close the connection with the current datanode and find the best datanode for reading the next block
When the block of the list is read and the file reading is not finished, the client development library continues to get the next batch of block lists from Namenode.
Checksum verification is performed after reading a block, and if an error occurs while reading the datanode, the client notifies the Namenode and then continues reading from the next datanode that owns the copy of the block.
The client (client) uses the open () function of FileSystem to open the file.
DistributedFileSystem calls the metadata node with RPC to get the block information of the file.
For each data block, the metadata node returns the address of the data node where the data block is saved.
DistributedFileSystem returns FSDataInputStream to the client to read the data.
The client calls the read () function of stream to start reading the data.
The DFSInputStream connection holds the nearest data node of the first data block of this file.
Data reads from the data node to the client (client).
When this data block is read, DFSInputStream closes the connection to this data node and then connects to the nearest data node of the next data block in this file.
When the client finishes reading the data, it calls the close function of FSDataInputStream.
In the process of reading data, if the client has an error communicating with the data node, it attempts to connect to the next data node that contains this data block.
Failed data nodes will be logged and will not be connected later.
Use SequenceFile, MapFile, Har and other ways to archive small files, the principle of this method is to file small files to manage, HBase is based on this. For this method, if you want to retrieve the contents of the original small file, you must know the mapping to the archived file.
Scale-out, a Hadoop cluster can manage a limited number of small files, so drag several Hadoop clusters behind a virtual server to form a large Hadoop cluster. Google has done the same thing.
Multi-Master design, this effect is obvious. The GFS II under development will also be changed to a distributed multi-Master design, and it also supports Master's Failover, and the Block size has been changed to 1m, intentionally optimizing the handling of small files.
It comes with an Alibaba DFS design, which is also a multi-Master design, which separates the mapping storage and management of Metadata, and consists of multiple Metadata storage nodes and a query Master node.
Client initiates a file read request to NameNode.
NameNode returns information about the DataNode stored in the file.
Client reads file information.
Client initiates a request for a file write to NameNode.
NameNode returns information about the part of DataNode it manages to Client based on file size and file block configuration.
Client divides the file into multiple Block and writes to each DataNode block sequentially according to the address information of the DataNode.
NameNode can be regarded as the manager of the distributed file system, which is mainly responsible for managing the namespace, cluster configuration information and storage block replication of the file system. NameNode stores the Meta-data of the file system in memory, which mainly includes the file information, the information of the file block corresponding to each file, and the information of each file block in the DataNode and so on.
DataNode is the basic unit of file storage. It stores Block in the local file system, saves the Meta-data of Block, and periodically sends all existing Block information to NameNode.
Client is an application that needs to get distributed file system files.
The metadata node is notified from the metadata node to generate a new log file, and subsequent logs are written to the new log file.
From the metadata node, use http get to get the fsimage file and the old log file from the metadata node.
Load the fsimage file into memory from the metadata node, perform the operations in the log file, and then generate a new fsimage file.
Return the metadata node from the new fsimage file using http post
The metadata node can replace the old fsimage file and the old log file with the new fsimage file and the new log file (generated in the first step), and then update the fstime file to write the time of the checkpoint.
In this way, the fsimage file in the metadata node stores the latest checkpoint metadata information, and the log file starts over and will not become very large.
When the file system client (client) writes, it is first recorded in the modification log (edit log)
The metadata node holds the metadata information of the file system in memory. After the modification log is recorded, the metadata node modifies the data structure in memory.
The modification log is synchronized (sync) to the file system before each write is successful.
The fsimage file, that is, the namespace image file, is the checkpoint of the metadata in memory on the hard disk. It is a serialized format and cannot be modified directly on the hard disk.
Similar to the mechanism of data, when the metadata node fails, the metadata information of the latest checkpoint is loaded into memory from fsimage, and then the operations in the modification log are re-performed one by one.
The slave metadata node is used to help the metadata node checkpoint the metadata information in memory to the hard disk.
The process of checkpoint is as follows:
Blk_ saves the data blocks of HDFS, in which the specific binary data is saved.
Blk_.meta saves the attribute information of the data block: version information, type information, and checksum
When the number of blocks in a directory reaches a certain number, a subfolder is created to hold the block and block attribute information.
The VERSION file format of the data node is as follows:
LayoutVersion is a negative integer that holds the format version number of HDFS's persistent data structure on the hard disk.
NamespaceID is the unique identifier of the file system and is generated when the file system is first formatted.
CTime is 0 here
StorageType indicates that what is stored in this folder is the data structure of the metadata node.
Thank you for reading! This is the end of the article on "which types of nodes are in the HDFS architecture". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.