Good programmer big data learning route sharing distributed file system HDFS 07/13 Update SLTechnology News&Howtos

Good programmer big data learning route sharing distributed file system HDFS

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Good programmer big data learning route sharing distributed file system HDFS, design goals:

1. Large storage capacity

2. Automatic and rapid detection to deal with hardware errors

3. Streaming access to data

4. Mobile computing is more cost-effective than mobile data itself

5. Simple consistency model

6. Heterogeneous platforms are portable

Characteristics

Advantages:

High reliability: Hadoop has strong ability to store and process data bit by bit

High scalability: hadoop distributes data and performs computing tasks among available computer clusters that can be easily extended to thousands of nodes

High efficiency: hadoop can move data dynamically between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

High fault tolerance: hadoop can automatically save multiple copies of data and automatically reassign failed tasks.

Disadvantages:

Not suitable for low-latency data access

Unable to store a large number of small files efficiently (each file store has its own index, so large metadata is not efficient)

Multi-user writing and arbitrary modification of files are not supported (can be deleted and appended, but cannot modify data somewhere in the file)

Important features:

The file is physically stored in chunks, and the block size can be specified by the configuration parameter (dfs.blocksize). The default 2.x version is 128m, and the old version is 64m.

HDFS will provide a unified abstract directory tree for crying and protecting short, and the client will access the file through the path. Hdfs://namenode:port/dir-a/dir-b/dir-c/file.data:

The management of directory structure and file partition information (metadata) is undertaken by namenode-namenode is the master node of the HDFS cluster and is responsible for maintaining the directory tree of the entire hdfs file system and the block block information corresponding to each path (file) (block's id and datanode server)

The storage management of each block of the file is undertaken by datanode-datanode is the slave node of the HDFS cluster, and each block can store multiple copies on multiple datanode (copy parameter is set to dfs.replication)

HDFS is designed to adapt to write-once, read-out scenarios, and does not support file modification.

Manage the namespace of the file system (metadata: contains file name, size, owner, address)

Prescribe rules for client access to files

Three services Namenode task list a) whether the file size has been overloaded (exceeding the load of the cluster)

B) whether the same file already exists

C) whether you have permission to create the file

Execute commands on a file, close, open a file, or open a path

All data nodes send heartbeats to NameNode, he needs to make sure that the data node DataNode is online, and a data block reports the status of all block on that data node.

First, load the fsimage (image) into memory, and read and execute the operations of the log editlog

Once the file system metadata mapping is established in memory, create a new fsimage file (this process does not require secondaryNamenode) and an empty editlog

In safe mode, each datanode sends the latest status of the block list to namenode

Namenode is running in safe mode at the moment. That is, NameNode's file system is read-only to the client.

NameNode starts listening for RPC and HTTP requests

Startup process

RPC:Remote Procedure Call Protocol--- remote procedure passes the protocol

It is a protocol that requests services from remote computer programs over the network without knowing the underlying network technology.

The location of blocks in the system is not maintained by namenode, but is stored in datanode as a list of blocks

During the normal operation of the system, namenode retains the mapping information of all block information in memory.

Fsimage: metadata image file (directory tree that holds the file system)

Edit.log: metadata operation log (modification operation for directory tree)

Two important files metadata mirror a) keep an up-to-date copy in memory

B) Image in memory = fsimage+edits

SecondaryNamenode work task

Merge fsimage and edits regularly

C) too large Edits file will cause slow restart of NamenNode

D) SecondaryNameNode is responsible for merging them regularly

Datanode

The writing process of hdfs

Write the process language Description:

Client communicates with the namenode process by calling the get method of FileSystem, and then calls the create method to request the creation of the file.

FileSystem creates a new file in namenode by making a remote request to namenode, but does not associate any blocks at this time. NameNode does a number of checks to ensure that the file you want to create does not already exist in the file system, and checks to see if you have the appropriate permissions to create the file. If these checks are finished, nameNameNode records the hee of the new file, and FileSystem returns a DFSOutputStream to the client to write the data. As in the case of reading, FSDataOutputStream wraps a DFSOutputStream for communicating with DataNode and NameNode. Once the file creation fails, the client will receive an IOException to identify the file creation failure and stop the subsequent tasks.

The client starts to write numbers. FSDataOutputStream divides the data to be written into blocks into packets and writes it to the intermediate queue of the DFSOutputStream object. The data is read by Datastreamer. The job of DataStreamer is to get NameNode to allocate new blocks-to find the appropriate DataNode to store the data replicated as a backup.

FSDataOutputStream maintains an internal queue about packets, which stores information about packets waiting to be confirmed by DataNode. This queue is called a waiting queue, and a packet message is removed from the queue if and only if the packet is confirmed by all nodes.

When the client invokes the close method of the stream after the data has been written, and then informs NameNode that the packets remains in the flush and waits for the confirmation message before it finishes the write. NameNode already knows which blocks the file consists of, so all you have to do is wait for the block to be copied minimally before the return is successful.

Write API:1. Upload from the local system to hdfs

Configuration hdfsConf = new Configuration (); / / create an environment variable for hdfs

String namenodeURI= "hdfs://hadoop001:8020"; / / uniform resource locator for namenode

String username= "root"; / / access the hdfs of the specified user

FileSystem hdfs = FileSystem.get (new URI (namenodeURI), hdfsConf,username); / / create a file system object for hdfs

FileSystem local = FileSystem.getLocal (new Configuration ()); / / create a local file system object

Hdfs.copyFromLocalFile (new Path (localPath), new Path (hdfsPath))

two。 Create a file on hdfs and directly give the contents of the file

FSDateOutputStream out = hdfs.create (new Path (hdfsPath))

Out.write (fileContent.getBytes ())

Out.close ()

The Reading process of hdfs

Read the process language Description:

The client or user opens the file that needs to be read by calling the open method of the FileSystem object, which is a common read instance of a distributed file system for HDFS.

FileSystem calls NameNode through the remote protocol to determine the location of the first few Block of the file. For each Block,Namenode, the "metadata" containing that Block is returned, that is, the basic information of the file; next, the DataNode is sorted by the distance defined above, if the Client itself is a DataNode priority from the local DataNode reading data. After the HDFS instance completes the above work, it returns a FSDataInputStream to the client to read data from the FSDataInputStream. FSDataInputStream then wraps an DFSInputStream to manage DataNode and NameNode's Imax O.

NameNode returns an address containing data information to the client, and the client Genu denigrates the creation of a FSDataInputStream to start reading the data.

FSDataInputStream starts to read the data from scratch according to the address of the DataNode of the first few Blocks and connects to the nearest DataNode. The client repeatedly calls the read () method to read data from DataNode in a streaming manner.

When reading the end of the Block, FSDataInputStream closes the address of the current DataNode and looks for the best DataNode that can read the next Block. These operations are transparent to the client, which senses a continuous stream, that is, it starts to look for the address of the next block when it is read.

After reading, the close () method is called to close the FSDataInputStream.

Read API:1. Download files from hdfs to local

Configuration hdfsConf = new Configuration (); / / create an environment variable for hdfs

String namenodeURI= "hdfs://hadoop001:8020"; / / uniform resource locator for namenode

String username= "root"; / / access the hdfs of the specified user

FileSystem hdfs = FileSystem.get (new URI (namenodeURI), hdfsConf,username); / / create a file system object for hdfs

FileSystem local = FileSystem.getLocal (new Configuration ()); / / create a local file system object

Hdfs.copyToLocalFile (new Path (hdfsPath), new Path (localPath))

3. Read the contents of a given file on hdfs

Path path = new Path (hdfsFilePath); / / File path

FSDataInputStream in = hdfs.open (path); / / get the file input stream

FileStatus status = hdfs.getFileStatus (path); / / get the metadata information of the file

/ / get the file size in the file metadata

Byte [] bytes = new byte [Integer.pareInt (String.valueOf (status.getLen ()]

/ / read out all the contents of the input stream at once

In.readFully (0 bytes)

System.out.println (new String (bytes)); / / print out the read file

In.close ()

The whole process of hdfs

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.