Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to add files and directories on HDFS

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "how to add files and directories on HDFS". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

HDFS file operation

You can store a set of big data (100TB) as a single file in HDFS, and most other file systems are unable to do this. Although there are multiple copies of the file distributed on multiple machines to support parallel processing, you don't have to consider these details.

What kind of file system is the HDFS (Hadoop Distribution File System) file system?

It is not a Unix file system and does not support standard Unix file commands such as ls and cp, nor standard file read and write operations such as fopen () and fread (). But Hadoop provides a set of command-line tools similar to Linux file commands.

What is a typical Hadoop workflow?

1. Generate a data file (such as a log file) somewhere else and copy it to HDFS.

2. The MapReduce program processes the data, reads the HDFS file and parses it into independent records (key / value pairs).

3. Unless you want to customize the import and export of data, you hardly need to program to read and write HDFS files.

What is the form of the Hadoop file command?

Hadoop fs-cmd

Cmd is a specific file command, but a variable number of parameters. The naming of cmd is usually the same as the command name corresponding to UNIX. For example, the command for the file list is: hadoop fs-ls

What are the most common file management tasks in Hadoop?

1. Add files and directories

2. Get the file

3. Delete files

Can the Hadoop file command interact with the local file system?

Hadoop's file commands can interact with both the HDFS file system and the local file system.

What does URI location mean? What is the complete URL format?

URL pinpoints the location of a particular file or directory.

The complete URL format is scheme://authority/path. Scheme is similar to a protocol that can be hdfs or file to specify the HDFS file system or the local file system, respectively.

For HDFS,authority, it is the hostname of NameNode, and path is the path to a file or directory.

For HDFS running in the standard pseudo-distributed model on port 9000 of the local machine, what is the URI that accesses the file example.txt in the user directory user/chuck?

Hdfs://localhost:9000/user/chuck/example.txt

Hadoop fs-cat hdfs://localhost:9000/user/chuck/example.txt

But usually we don't specify the scheme://authority part of the URI when we use the Hadoop file command

What's going on?

Yes, most settings do not need to specify the scheme://authority part of the URI

For example, when copying files between the local file system and HDFS

1. The put command copies the local file to the HDFS. The source is the local file, and the purpose is the HDFS file.

2. The get command copies the files in the HDFS locally, the source is the HDFS file, and the purpose is the local file.

If the scheme://authority part of the URI is not set, the configuration of the fs.default.name property of the default configuration of Hadoop is adopted.

For example, the configuration of conf/core-site.xml file is:

Fs.default.name

Hdfs://localhost:9000

In this configuration, URI hdfs://localhost:9000/user/chuck/example.txt is shortened to / user/chuck/example.txt

Note: some earlier documents represent file tools in the form of hadoop dfs-cmd. Dfs and fs are equivalent, but now they both use fs

What is the current working directory of HDFS by default?

HDFS defaults to / user/$USER, where $USER is your login user name.

Example: if you log in as chuck, URI hdfs://localhost:9000/user/chuck/example.txt is shortened to example.txt. The Hadoop cat command that displays the contents of the file can be written as follows:

Hadoop fs-cat example.txt

How do I add files and directories to HDFS?

First of all, we must ensure a few points:

1. Must be formatted first

2. For the purpose of learning, pseudo-distribution is recommended.

3. The default working directory is / user/$USER, but this directory will not be created automatically. It needs to be created manually.

Manually create a default working directory

Hadoop fs-mkdir / user/chuck

Hadoop's mkdir command automatically creates a parent directory, similar to the mkdir command in UNIX with the-p option, so the above command also creates the / user directory.

Hadoop fs-ls / this command lists all files and directories in the root directory

Hadoop fs-lsr / this command lists all files and subdirectories under the root directory

Hadoop fs-put example.txt. Put the local file example.txt into HDFS.

The following (.) means that the file is placed in the default working directory, which is equivalent to

Hadoop fs-put example.txt / user/chuck

Hadoop fs-ls lists all files and directories in the working directory

Use the hadoop fs-ls command to list content and replication factors

Typically, view like this is listed

Found 1 items

-rw-r--r-- 1 chuck supergroup 264 2009-01-14 11:02 / user/chuck/example.txt

Display attribute information, others do not explain similar to the concept of UNIX. Mainly say that "1" lists the replication factor of the file. In pseudo-distributed environment, it is always 1. For clusters in production environment, the replication factor is usually 3, or it can be any positive integer. Replication factor does not apply to directories, so the column displays a (-).

How do I retrieve files?

Hadoop's get command, in contrast to the put command, copies files from HDFS to the local file system.

For example, hadoop fs-get example.txt copies it to the current working directory locally.

Hadoop fs-cat example.txt can also display data using the cat command.

You can also use the pipe command hadoop fs-cat example.txt | head

You can use the tail command to view the last thousand bytes:

Hadoop fs-tail example.txt

How do I delete a file?

Delete files and empty directories using the rm command

Hadoop fs-rm example.txt

How can I check for help?

Hadoop fs (no arguments) to get a complete list of commands for the version of Hadoop used.

Hadoop fs-help ls shows the usage and short description of each command.

HDFS's Java API?

Although command-line tools are sufficient to meet most requirements for interacting with HDFS file systems, some requirements can only be accessed using Java API, such as developing a PutMerge program for merging files and putting them into HDFS, which is not supported by command-line tools.

When will the files be merged and put into HDFS?

Consider a scenario where you have an incentive to build this routine when you need to analyze Apache log files from many Web servers. Although we can copy each log file into HDFS, generally speaking, Hadoop is more efficient in dealing with a single large file than many small files.

Why is the log data scattered in multiple files?

This is due to the distributed architecture of the Web server.

One solution is? Why not just merge?

One solution is to merge all the files before copying them to HDFS. However, file merging takes up a lot of disk space on the local computer. If only we could merge them in the process of copying to HDFS.

The Hadoop command line tool getmerge command.

The getmerge command is used to merge a set of HDFS files before copying to the local computer, but we want the exact opposite. We're asking for putmerge orders.

Hadoop file operation API is?

The main class used for file manipulation in Hadoop is located in the org.apache.hadoop.fs package. The basic file operations of Hadoop, including open, read, write, and close,Hadoop, can also be used in file systems other than HDFS.

The starting point for the Hadoop file API is the FileSystem class, an abstract class that interacts with the file system, and there are different concrete implementation subclasses for dealing with HDFS and the local file system. You can get the desired FileSystem instance by calling the factory method FileSystem.get (Configuration conf). The Configuration class is a special class used to retain key / value configuration parameters. Its default instantiation method is based on the resource allocation of the HDFS system.

How do I get a FileSystem object?

Configuration conf = new Configuration ()

FileSystem hdfs = FileSystem.get (conf)

To get a FileSystem object dedicated to the local file system, you can use the FileSystem.getLocal (Configuration conf) of the factory method

FileSystem local = FileSystem.getLocal (conf)

Other instructions:

Hadoop files API uses Path objects to compile file and directory names, and FileStatus objects to store metadata for files and directories. The PutMerge program merges all files in a local directory. We use the listStatus () method of FileSystem to get a list of files in the directory:

Path inputDir = new Path (args [0])

FileStatus [] inputFiles = local.listStatus (inputDir)

The length of the array inputFiles is equal to the number of files in the specified directory. Every FileStatus object in inputFiles has metadata information, such as file length, permissions, modification time, and so on. What the PutMerge program cares about is the Path of each file, namely inputFiles [I]. GetPath (). We can access this Path through the FSDataInputStream object to read the file.

FSDataInputStream in = local.open (inputFiles [I] .getPath ())

Byte buffer [] = new byte [256]

Int bytesRead = 0

While ((bytesRead = in.read (buffer)) > 0) {

...

}

In.close ()

FSDataInputStream, a subclass of the Java standard class java.io.DataInputStream, adds support for random access. Similarly, there is a FSDataOutputStream object that writes data to the HDFS file:

Path hdfsFile = new Path (args [1])

FSDataOutputStream out = hdfs.create (hdfsFile)

Out.write (buffer, 0, bytesRead)

Out.close ()

This is the end of "how to add files and directories to HDFS". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report