How to analyze the characteristics of HDFS and JavaAPI source code 04/10 Update SLTechnology News&Howtos

How to analyze the characteristics of HDFS and JavaAPI source code

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to carry out the characteristics of HDFS and JavaAPI source code analysis, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Overview of 1.HDFS

HDFS is an Apache Software Foundation project and a subproject of the Apache Hadoop project. Hadoop is ideal for storing large data, such as terabytes and petabytes, and uses HDFS as its storage system. HDFS allows you to connect nodes (ordinary personal computers) contained in multiple clusters, where data files are distributed. You can then access and store those data files as a seamless file system. Access to data files is handled in a streamlined (streaming) manner, which means that applications or commands are executed directly through the MapReduce processing model (see Resources). HDFS is fault tolerant and provides high throughput access to large datasets.

HDFS has many similarities with other distributed file systems, but there are also several differences. One obvious difference is HDFS's write-once, read-multiple (write-once-read-many) model, which reduces concurrency control requirements, simplifies data aggregation, and supports high-throughput access.

Another unique feature of HDFS is the view that it is usually better to place processing logic near data than to move data into application space.

HDFS strictly limits data writes to one writer at a time. Bytes are always appended to the end of a stream, and byte streams are always stored in write order.

HDFS has many goals, and here are some of the most obvious:

Fault tolerance is achieved through fault detection and fast and automatic recovery of applications

Data access through MapReduce streams

A simple and reliable aggregation model

Processing logic is close to data, not data close to processing logic

Portability across heterogeneous common hardware and operating systems

Scalability of reliable storage and processing of large amounts of data

Cost savings by distributing data and processing across multiple ordinary PC clusters

Improve efficiency by distributing data and logic to multiple nodes where the data is located for parallel processing

Reliability is achieved by automatically maintaining multiple copies of data and automatically redeploying processing logic in the event of a failure

HDFS Application programming Interface (API)

You can access HDFS in many different ways. HDFS provides a native Java ™application programming interface (API highlights) and a native C language wrapper for this Java API. In addition, you can use a web browser to browse HDFS files. The following exceptions are available for access to HDFS:

SHELL operation of 2.HDFS

Since HDFS is a distributed file system for accessing data, the operation of HDFS is the basic operation of the file system, such as file creation, modification, deletion, modification permissions, folder creation, deletion, renaming and so on. The operation command for HDFS is similar to the operation of lLinux's shell for files, such as ls, mkdir, rm, and so on. When we do the following, make sure that hadoop is working properly and use the jps command to make sure you see each hadoop process. SHELL operation is more, see the document Hadoop-Shell.pdf and the link http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html, it is not introduced one by one.

Architecture and basic Concepts of 3.HDFS

The files we upload through hadoop shell are stored in the block of DataNode. You can't see the files through linux shell, only block. HDFS can be described in one sentence: the client's large files are stored in data blocks of many nodes. Here, there are three keywords: file, node, and data block. HDFS is designed around these three keywords, we should also grasp these three keywords to learn.

HDFS consists of clusters of interconnected nodes on which files and directories reside. A HDFS cluster contains a node called NameNode, which manages the file system namespace and regulates client access to files. In addition, Data node (DataNodes) stores the data as blocks in the file. The architecture of HDFS is as follows:

As you can see from the figure above: in HDFS, a given NameNode is the management node of the entire file system. It maintains the file directory tree of the whole file system, the meta-information of the file / directory and the list of data blocks corresponding to each file. Namenode also maps data blocks to Data node and handles read and write requests from HDFS clients. Data node also creates, deletes, and copies data blocks according to the instructions of Name node. 3.1 NameNode

The files are placed in the specified directory (determined by the dfs.name.dir attribute of the configuration file core-site.xml). This directory includes:

Fsimage: metadata image file. Store NameNode memory metadata information for a certain period of time. In memory.

Edits: operation log file.

Fstime: the time when the last checkpoint was saved

These files are saved under the file system under the linux system. All operations of HDFS clients must go through NameNode.

3.2 DataNode

Provide real document storage service.

Block: the most basic unit of storage. For the content of a file, the length of a file is size, so starting from the zero offset of the file, the file is divided and numbered according to the fixed size and order, and each block is called a Block. The default Block size of HDFS is 128MB (configurable). With a 256MB file, a total of 256 Block.block is essentially a logical concept, which means that data is not really stored in the block, only files are divided. Different from the ordinary file system, in HDFS, if a file is less than the size of a data block, it does not occupy the whole block storage space.

Can be found in the Linux file system of each host:

Replication: multiple copies. The default is three. Configurable: properties of dfs.replication in hdfs-site.xml. Cabinet Awareness:

Typically, large HDFS clusters are arranged across multiple installation points (cabinets). Network traffic between different nodes in an installation is usually more efficient than network traffic across installation points. A Name node tries to put multiple copies of a block on multiple installations to improve fault tolerance. However, HDFS allows administrators to decide which mount point a node belongs to. Therefore, each node knows its cabinet ID, that is, it has cabinet awareness. 3.3Storage information of NameNode metadata

See the picture below: no explanation.

3.4 Communication mechanism between NameNode and DataNode

The Data node loops continuously, asking for instructions from Name node. Name node cannot connect directly to Data node, it just returns a value from a function called by Data node. Each Data node maintains an open server socket so that client code or other Data node can read and write data. Name node knows the host or port of this server and provides information to the client or other Data node. All HDFS communication protocols are built on top of TCP/IP. The HDFS client connects to an open Transmission Control Protocol (TCP) port on Name node and communicates with Name node using a proprietary protocol based on Remote Procedure Call (RPC). Data node uses a proprietary block-based protocol to communicate with Name node. Hadoop the entire ecosystem is based on the RPC protocol. ‍ will make an exception to write a blog about the communication principles of RPC. Link. ‍

A solution of 3.5 HDFS impure HA Secondary Name node

In fact, in order to improve the reliability and maintainability of the whole cluster, companies and communities have put forward a lot of plans to improve HDFS. Here, I will temporarily introduce one of HDFS. Later, the blog will describe the solution of HDFS HA in detail and summarize it.

Secondary NameNode backs up NameNode by periodically downloading NameNode's metadata and log files and merging and updating them. When NameNode fails, it can be restored through Secondary NameNode, but the deficiency is that the backup knowledge of Secondary NameNode, the Checkpoint of NameNode, is not synchronized with NameNode in real time, and there is a certain loss of meta-information in the restored data. because there is a period of time when the system is unavailable in the recovery process, this scheme can only be a backup scheme, not a real HA scheme. The flow chart is as follows:

Execution of NameNode:

Namenode always keeps metedata in memory for processing read requests. When a "write request" arrives, namenode will first write editlog to disk, that is, log to the edits file, and then modify the memory and return it to the client after a successful return. Hadoop maintains a fsimage file, that is, a mirror of the metedata in namenode, but fsimage does not always keep up with the metedata in namenode memory, but updates the content by merging the edits file at regular intervals. Who will merge? -> Secondary namenode is used to merge fsimage and edits files to update NameNode's metedata.

The workflow of Secondary namendoe:

1.secondary tells namenode to switch edits files (when NameNode generates newedits files) 2.secondary gets fsimage from namenode and edits (through http) 3.secondary loads fsimage into memory, and then starts merging (with a certain algorithm) edits4.secondary sends the new fsimage back to namenode5.namenode to replace Java API access in the old fsimage4.HDFS with the new fsimage

Go directly to the code:

HadoopUtil.javapackage com.codewatching.hadoop.service;import java.net.URI;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;/** * Hadoop simple utility class * @ author LISAI * / public class HadoopUtil {public static FileSystem getFileSystem () {try {Configuration conf = new Configuration () URI uri = new URI ("hdfs://yun10-0FileSystem fileSystem 9000 /"); FileSystem fileSystem = FileSystem.get (uri, conf, "lisai"); return fileSystem;} catch (Exception e) {e.printStackTrace ();} return null Operation of HadoopBasicAPIService.javapackage com.codewatching.hadoop.service;import java.io.FileOutputStream;import java.io.OutputStream;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IOUtils;/** * Hadoop API * @ author LISAI * / public class HadoopBasicAPIService {private FileSystem fileSystem = HadoopUtil.getFileSystem () / * * traditional method-download * / @ Deprecated public void downLoadTraditional (String uri,String dest) throws Exception {FSDataInputStream dataInputStream = fileSystem.open (new Path (uri)); OutputStream outputStream = new FileOutputStream (dest); IOUtils.copyBytes (dataInputStream, outputStream, fileSystem.getConf ()) } / * commonly used method-download (notes in windows environment) * / public void downLoadSimple (String src,String dest) throws Exception {fileSystem.copyToLocalFile (new Path (src), new Path (dest)) } / * * upload * @ param src * @ param dest * @ throws Exception * / public void upload (String src,String dest) throws Exception {fileSystem.copyFromLocalFile (new Path (src), new Path (dest)) } / * create folder * @ param makeDir * @ throws Exception * / public void mkdir (String makeDir) throws Exception {fileSystem.mkdirs (new Path (makeDir)) } / * Delete folder * / public void deldir (String delDir) throws Exception {fileSystem.delete (new Path (delDir), true);}} 'read' file source code analysis in 5.HDFS

1. Initialize FileSystem, and then the client (client) uses the open () function of FileSystem to open the file 2.FileSystem calls the metadata node with RPC to get the data block information of the file. For each data block, the metadata node returns the address of the data node where the data block is saved. 3.FileSystem returns FSDataInputStream to the client to read the data, and the client calls the read () function of stream to start reading the data. The 4.DFSInputStream connection holds the nearest data node of the first data block of this file, and data reads from the data node to the client (client) 5. When this data block is read, DFSInputStream closes the connection to this data node and then connects to the nearest data node of the next data block in this file. 6. When the client finishes reading the data, it calls the close function of FSDataInputStream. 7. In the process of reading data, if the client has an error communicating with the data node, it attempts to connect to the next data node that contains this data block. 8. Failed data nodes will be logged and will not be connected later.

The following steps are how to get the process of creating and proxying objects.. Only when we get the proxy object of the client can we operate on HDFS. The principle of the RPC process will be discussed in a large margin in future blogs.

Dfs= DFS [clientName = DFSClient_NONMAPREDUCE_1386880610_1, ugi=lisai (auth:SIMPLE)] fs= DFS [clientName = DFSClient_NONMAPREDUCE_1386880610_1, ugi=lisai (auth:SIMPLE)]]

Initialization complete.

Several method calls are omitted in the middle.

Now you get the FSDataInputStream stream to the client. The next step is for COPY.

After reading the above, have you mastered the characteristics of HDFS and the method of JavaAPI source code analysis? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.