Example Analysis of Hadoop Outline 07/19 Update SLTechnology News&Howtos

Example Analysis of Hadoop Outline

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

The editor will share with you the example analysis of Hadoop Outline. I hope you will get something after reading this article. Let's discuss it together.

Hdfs Java API Sample

Read by hadoop FsURLStreamHandlerFactory

Read/Write by hadoop DistributeFileSystem

Package com.jinbao.hadoop.hdfs;import java.io.IOException;import java.io.InputStream;import java.io.OutputStream;import java.net.URI;import java.net.URL;import org.apache.commons.io.IOUtils;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataOutputStream;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;import org.apache.hadoop.fs.Path / * / @ author cloudera * * / public class HdfsClient {static String sFileUrl = "hdfs://quickstart.cloudera/gis/gistool/README.md" / * @ param args * @ throws IOException * / public static void main (String [] args) throws IOException {if (args.length > = 2) {String sUrl = sFileUrl If (args [0] .equals IgnoreCase ("- r-url")) {sUrl = args [1]; / / test read by hadoop FsURLStreamHandlerFactory readHdfsFileByDfsUrl (sUrl) } else if (args [0] .equals IgnoreCase ("- r-file")) {sUrl = args [1]; / / test read by hadoop dfsFile readHdfsFileByDfsFileApi (sUrl) } else if (args [0] .equals IgnoreCase ("- w-file")) {sUrl = args [1]; / / test read by hadoop dfsFile writeHdfsFileByDfsFileApi (sUrl) } else if (args [0] .equals IgnoreCase ("- w-del")) {sUrl = args [1]; / / test read by hadoop dfsFile deleteHdfsFileByDfsFileApi (sUrl) }} private static void deleteHdfsFileByDfsFileApi (String sUrl) {Configuration conf = new Configuration (); try {FileSystem fs = FileSystem.get (URI.create (sUrl), conf) Path path = new Path (sUrl); fs.delete (path,true);} catch (IOException e) {e.printStackTrace () } finally {}} private static void writeHdfsFileByDfsFileApi (String sUrl) {Configuration conf = new Configuration (); OutputStream out = null; byte [] data = "Writing Test" .getBytes () / / Get a FSDataInputStream object try {/ / Get a FSDataInputStream object, actually is HdfsDataInputSteam FileSystem fs = FileSystem.get (URI.create (sUrl), conf); Path path = new Path (sUrl) If (fs.exists (path)) {out = fs.append (path); IOUtils.write (data, out);} else {out = fs.create (path) Out.write (data); / / flush buffer to OS out.flush (); FSDataOutputStream fsout = FSDataOutputStream.class.cast (out) / / Sync data to disk fsout.hsync (); / / call sync implicitly out.close () }} catch (IOException e) {/ / TODO Auto-generated catch block e.printStackTrace ();} finally {IOUtils.closeQuietly (out) }} public static void readHdfsFileByDfsUrl (String sUrl) {URL.setURLStreamHandlerFactory (new FsUrlStreamHandlerFactory ()); InputStream in = null; try {URL url = new URL (sUrl) In = url.openStream (); IOUtils.copy (in,System.out);} catch (IOException ioe) {ioe.printStackTrace ();} finally {IOUtils.closeQuietly (in) }} private static void readHdfsFileByDfsFileApi (String sUrl) {Configuration conf = new Configuration (); InputStream in = null; try {FileSystem fs = FileSystem.get (URI.create (sUrl), conf) / / Get a FSDataInputStream object, actually HdfsDataInputStream in = fs.open (new Path (sUrl)); IOUtils.copy (in,System.out);} catch (IOException ioe) {ioe.printStackTrace () } finally {IOUtils.closeQuietly (in);} Data Flow on Read

1. FileSystem to get DistributeFileSystem

2. DistributeFileSystem gets the first few block locations through and namenode calls, and returns the address of the datanode, including the backup node.

3. The client calls DistributeFileSystem to get the FSDataInputStream, which connects to the nearest datanode and reads the data by calling Read.

4. If the current block has been read, FSDataInputStream closes the block, looks for the next best datanode, and continues to read.

5. After reading the last block, call the close method to close the data flow.

Fault tolerant processing

1. FSDataInputStream tries the next best backup node if it encounters an error communicating with datanode. In addition, the bad node will be remembered and the data will not be read on it later. Also, the namenode is notified which node has a problem.

Data Flow on Write

1. FileSystem to get DistributeFileSystem

2. DistributeFileSystem calls the create () method with the RPC of namenode to create the original data of the file. If it already exists, report IOException.

3. The client calls DistributeFileSystem to get FSDataOutputStream, which encapsulates a DFSOutputStream object to handle datanode and namenode communication.

4. When the client starts to write data, FSDataOutputStream divides the current block data into packet packet and writes a packet queue (Data packet queue). Then datastreamer asks namenode to assign appropriate datanode,DataStreamer to form a data pipeline (datanode pipeline) according to the datanode list, and the number is determined by dfs.replication.

5. Start writing data, each time a packet is written, it is backed up to another acknowledgement queue (Data Ack queue). The first node to be written will write the data to the second node, and then the third node. If a written notification is received, it is deleted from the confirmation queue.

6. After writing a block, repeat 4-5 until it is finally written, and call the close method to close the data flow.

Fault tolerant processing

If FSDataOutputStream encounters an error when communicating with datanode

1. Close pipeline

two。 The packets of the acknowledgement queue are added to the packet queue to prevent downstream nodes (downstream node) from losing data.

3. Create a new identifier for storing the current data of the normal datanode and pass this tag to the namenode so that namenode can delete some of the data from the failed node.

4. Write the rest of the data to the remaining good datanode. Namenode creates a new node to replicate the data to reach the number of copies. For the current write process, it is considered successful if the successful node reaches dfs.replication.min, and the rest is copied by namenode.

The layout of the copy

Take care of stability and load balancing

The default layout strategy of Hadoop is to place the first copy on the running client. If the client is outside the cluster, a node is randomly selected in the cluster.

The second and third randomly select two nodes on the same Rack.

Distributed replication distcp

% hadoop distcp hdfs://namenode/foo hdfs://namenode2/foo

Distcp is implemented using map-reduce jobs, which is very suitable for synchronizing data between two data centers.

If the versions of the two data centers are not consistent, you can try the hftp protocol to make the job run on the target system.

Hadoop distcp h ftp://namenode:50070/foo hdfs://namenode2/foo

Note: hftp port 50070 needs to be specified

In order to make map balance the cluster, you can refer to the setting of Numeric 20:-m 20 NMagne N is the total number of nodes.

Balancer hasn't seen it yet.

Archiving tool Har

Reduce the memory of namenode, suitable for managing small files, it is still transparent, and it is also suitable for map-reduce.

% hadoop archive-archivename files.har / myfiles/ / my

Deficiency:

Har is equivalent to tar function, you can package files, do not support compression. Only saved namenode memory.

Once created, it cannot be modified. If you want to add and delete files, you must re-establish har.

After reading this article, I believe you have some understanding of "sample Analysis of Hadoop Outline". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.