The construction of Java operation HDFS development environment and the reading and writing process of HDFS 07/13 Update SLTechnology News&Howtos

The construction of Java operation HDFS development environment and the reading and writing process of HDFS

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Building HDFS Development Environment for Java Operation

Previously, we have introduced how to build a HDFS pseudo-distributed environment on Linux, as well as some commonly used commands in hdfs. But how do you operate at the code level? This is what this section will cover:

1. First create a maven project using IDEA:

Maven does not support cdh repositories by default. You need to configure cdh repositories in pom.xml, as follows:

Cloudera https://repository.cloudera.com/artifactory/cloudera-repos/

Note: if you configure the value of mirrorOf to * in the settings.xml file of maven, you need to change it to *,! cloudera or central, because * means covering all warehouse addresses will cause maven to be unable to download dependent packages from cloudera's warehouse, while *,! cloudera means not overwriting the warehouse where id is cloudera. You can learn about this problem yourself. The specific configuration is as follows:

Alimaven aliyun maven http://maven.aliyun.com/nexus/content/groups/public/ *,! cloudera

Finally, add related dependencies:

UTF-8 2.6.0-cdh6.7.0 org.apache.hadoop hadoop-client ${hadoop.version} junit junit 4.10 test Java API operates the HDFS file system

After building the project environment, we can call Hadoop's API to operate the HDFS file system. Let's write a test case and create a directory on the HDFS file system:

Package org.zero01.hadoop.hdfs;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.junit.After;import org.junit.Before;import org.junit.Test;import java.net.URI / * * @ program: hadoop-train * @ description: Hadoop HDFS Java API operation * @ author: 01 * @ create: 2018-03-25 13:59 * * / public class HDFSAPP {/ / HDFS file system server address and port public static final String HDFS_PATH = "hdfs://192.168.77.130:8020"; / / HDFS file system Operand object FileSystem fileSystem = null; / / configuration object Configuration configuration = null / * create a HDFS directory * / @ Test public void mkdir () throws Exception {/ / need to pass a Path object fileSystem.mkdirs (new Path ("/ hdfsapi/test"));} / / prepare resources @ Before public void setUp () throws Exception {configuration = new Configuration () / / the first parameter is the URI of the server, the second parameter is the configuration object, and the third parameter is the user name of the file system fileSystem = FileSystem.get (new URI (HDFS_PATH), configuration, "root"); System.out.println ("HDFSAPP.setUp");} / release resources @ After public void tearDown () throws Exception {configuration = null; fileSystem = null System.out.println ("HDFSAPP.tearDown");}}

Running result:

You can see that it ran successfully, and then go to the server to see if there are more files than the directory we created:

[root@localhost] # hdfs dfs-ls / Found 3 items-rw-r--r-- 1 root supergroup 311585484 2018-03-24 23:15 / hadoop-2.6.0-cdh6.7.0.tar.gzdrwxr-xr-x-root supergroup 0 2018-03-25 22:17 / hdfsapi-rw-r--r-- 1 root supergroup 49 2018-03-24 23:10 / hello.txt [root@localhost ~] # hdfs dfs-ls / HdfsapiFound 1 itemsdrwxr-xr-x-root supergroup 0 2018-03-25 22:17 / hdfsapi/test [root@localhost ~] #

As above, our directory has been created successfully.

Let's add another method to test the creation of the file and write something to the file:

/ * create file * / @ Testpublic void create () throws Exception {/ / create file FSDataOutputStream outputStream = fileSystem.create (new Path ("/ hdfsapi/test/a.txt")); / / write something to the file outputStream.write ("hello hadoop" .getBytes ()); outputStream.flush (); outputStream.close ();}

After the execution is successful, the same goes to the server to see if there is a file we created and whether the content of the file is what we wrote:

[root@localhost] # hdfs dfs-ls / hdfsapi/testFound 1 items-rw-r--r-- 3 root supergroup 12 2018-03-25 22:25 / hdfsapi/test/a.txt [root@localhost] # hdfs dfs-text / hdfsapi/test/a.txthello hadoop [root@localhost ~] #

After each operation, you have to check it on the server, which is very troublesome. In fact, we can also read the contents of a file in the file system directly in the code, as shown in the following example:

/ * View the contents of a file in HDFS * / @ Testpublic void cat () throws Exception {/ / read the file FSDataInputStream in = fileSystem.open (new Path ("/ hdfsapi/test/a.txt")); / / output the contents of the file to the console, and the third parameter indicates how many bytes of content IOUtils.copyBytes (in, System.out, 1024); in.close ();}

Now that you know how to create directories, files, and read the contents of files, maybe we also need to know how to rename files, as shown in the following example:

/ * rename file * / @ Testpublic void rename () throws Exception {Path oldPath = newPath ("/ hdfsapi/test/a.txt"); Path newPath = newPath ("/ hdfsapi/test/b.txt"); / / the first parameter is the name of the original file, and the second is the new name fileSystem.rename (oldPath, newPath);}

We already know how to add, check, and modify. The only thing missing is the last deletion, as shown in the following example:

/ * delete file * @ throws Exception * / @ Testpublic void delete () throws Exception {/ / the second parameter specifies whether to delete recursively, false= No, true= is fileSystem.delete (new Path ("/ hdfsapi/test/mysql_cluster.iso"), false);}

After introducing the addition, deletion, query and modification of files, let's take a look at how to upload local files to the HDFS file system. I have a local.txt file here, which contains the following contents:

This is a local file

Write the test code as follows:

/ * upload local files to HDFS * / @ Testpublic void copyFromLocalFile () throws Exception {Path localPath = new Path ("E:/local.txt"); Path hdfsPath = new Path ("/ hdfsapi/test/"); / / the first parameter is the path of the local file, and the second is the path of HDFS (localPath, hdfsPath);}

After the above methods have been successfully executed, let's go to HDFS to see if the copy is successful:

[root@localhost] # hdfs dfs-ls / hdfsapi/test/Found 2 items-rw-r--r-- 3 root supergroup 12 2018-03-25 22:33 / hdfsapi/test/b.txt-rw-r--r-- 3 root supergroup 20 2018-03-25 22:45 / hdfsapi/test/local.txt [root@localhost] # hdfs dfs-text / hdfsapi/test/local.txtThis is a local file [root@localhost ~] #

The above demonstrates uploading a small file, but if I need to upload a larger file and want to have a progress bar, I have to use the following method:

/ * upload large local files to HDFS and display the progress bar * / @ Testpublic void copyFromLocalFileWithProgress () throws Exception {InputStream in = new BufferedInputStream (new FileInputStream (new File ("E:/Linux Install/mysql_cluster.iso") FSDataOutputStream outputStream = fileSystem.create (new Path ("/ hdfsapi/test/mysql_cluster.iso"), new Progressable () {public void progress () {/ / progress bar output System.out.print (".");}}); IOUtils.copyBytes (in, outputStream, 4096); in.close (); outputStream.close ();}

Similarly, after the above methods are successfully executed, we go to HDFS to see if the upload is successful:

[root@localhost] # hdfs dfs-ls-h / hdfsapi/test/Found 3 items-rw-r--r-- 3 root supergroup 12 2018-03-25 22:33 / hdfsapi/test/b.txt-rw-r--r-- 3 root supergroup 20 2018-03-25 22:45 / hdfsapi/test/local.txt-rw-r--r-- 3 root supergroup 812.8 M 2018-03-25 23:01 / hdfsapi/test/mysql_ cluster.iso[ root @ localhost ~] #

Since there are uploaded files, there are naturally downloaded files, and there are two ways to upload files. Therefore, there are two ways to download files, as shown in the following example:

/ * download HDFS file 1 * / @ Testpublic void copyToLocalFile1 () throws Exception {Path localPath = new Path ("E:/b.txt"); Path hdfsPath = new Path ("/ hdfsapi/test/b.txt"); fileSystem.copyToLocalFile (hdfsPath, localPath);} / * download HDFS file 2 * * / @ Testpublic void copyToLocalFile2 () throws Exception {FSDataInputStream in = fileSystem.open (new Path ("/ hdfsapi/test/b.txt")) OutputStream outputStream = new FileOutputStream (new File ("E:/b.txt")); IOUtils.copyBytes (in, outputStream, 1024); in.close (); outputStream.close ();} Note: the first download method of the above demonstration may report null pointer errors on the windows operating system. The second method is recommended on windows.

Let's demonstrate how to list all the files in a directory, an example:

/ * View all the files in a directory * * @ throws Exception * / @ Testpublic void listFiles () throws Exception {FileStatus [] fileStatuses = fileSystem.listStatus (new Path ("/ hdfsapi/test/")); for (FileStatus fileStatus: fileStatuses) {System.out.println ("this is a:" (fileStatus.isDirectory ()? "folder": "File"); System.out.println ("replica factor:" + fileStatus.getReplication ()); System.out.println ("size:" + fileStatus.getLen ()); System.out.println ("path:" + fileStatus.getPath () + "\ n");}}

The console print results are as follows:

This is a: file copy coefficient: 3 size: 12 path: hdfs://192.168.77.130:8020/hdfsapi/test/b.txt this is a: file copy coefficient: 3 size: 20 path: hdfs://192.168.77.130:8020/hdfsapi/test/local.txt this is a: file copy coefficient: 3 size: 852279296 path: hdfs://192.168.77.130:8020/hdfsapi/test/mysql_cluster.iso

Notice that from the console print results, we can see a problem: we have previously set the copy coefficient in hdfs-site.xml to 1, so why does the query file see a coefficient of 3?

In fact, this is because we uploaded these files locally through Java API, and we did not set the copy coefficient locally, so we will use the default copy coefficient of Hadoop: 3.

If we were on the server, using the hdfs command put, then we would have adopted the replica factor we set in the configuration file. If you don't believe it, you can change the path to the root directory in the code, and the console output is as follows:

This is a: file copy coefficient: 1 size: 311585484 path: hdfs://192.168.77.130:8020/hadoop-2.6.0-cdh6.7.0.tar.gz this is a: folder copy coefficient: 0 size: 0 path: hdfs://192.168.77.130:8020/hdfsapi this is a: file copy factor: 1 size: 49 path: hdfs://192.168.77.130:8020/hello.txt

The files in the root directory are all uploaded by the hdfs command put, so the copy coefficient of these files is the copy coefficient we set in the configuration file.

HDFS write data flow

With regard to the data flow of HDFS, I found a very concise and easy-to-understand cartoon form on the Internet to explain the principle of HDFS. The author is unknown. It is much easier to understand than the general PPT, and it is a rare learning material, which is hereby excerpted in this article.

1. Three parts: client, NameNode (which can be understood as the inode of master control and file index similar to linux), and DataNode (the server where the actual data is stored)

2. HDFS data writing process:

HDFS read data flow

3. Process of reading data

4. Fault tolerance: part one: fault types and their detection methods (nodeserver faults, and network faults, and dirty data problems)

5. Fault tolerance part 2: read and write fault tolerance

6. Fault tolerance part 3: dataNode failure

7. Backup rules

8. Concluding remarks

There is also a Chinese version of this cartoon, the address is as follows:

Https://www.cnblogs.com/raphael5200/p/5497218.html

Advantages and disadvantages of HDFS file system

Advantages of HDFS:

Data redundancy (multi-copy storage), hardware fault-tolerant processing, streaming data access, multiple reads at a time, suitable for storing large files, can be built on cheap machines to save costs.

Disadvantages of HDFS:

Not suitable for low-latency data access cannot efficiently store a large number of small files because even files with only 1m have their own metadata. So if there are a large number of small files, the more storage space the corresponding metadata needs, and too much metadata will increase the pressure on NameNode.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.