In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article will explain in detail what HDFS is suitable for you to do. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
HDFS is suitable for:
Store large files. Go to G, T or even P.
Write once, read multiple times. And each job has to read most of the data.
It can be built on an ordinary commercial cluster. Although there will be frequent downtime, HDFS has a good fault-tolerant mechanism.
HDFS is not suitable for:
Real-time data acquisition. If you have this need, you can use HBase.
A lot of small files. Because namenode stores the metadata of the HDFS (such as the tree structure of the directory, the file name, ACL, length, owner, location of the file contents, and so on), the number of files on the HDFS is limited by namenode memory.
Write and modify in a concurrent environment.
Block
The block of a disk is usually 512B, and the kernel cannot read and write disks less than this number at a time. The default size of a Block on HDFS is 64m. The size of the ds.block.size HDFS block can be set. On many workstations, the size of a block is set to 128m. The reason for setting the block so large is that the files on the HDFS are generally large files. If the block is very small, then a file will be stored on a lot of block, and the location information will be recorded by the namenode. On the one hand, it will waste the storage space of the namenode, and on the other hand, the overhead of retrieving a file is relatively high.
When the length of a file is less than a block size, it takes up a separate block, but the disk space it takes up is still its true length.
Namenode and Datanode
Namenode manages the namespace of the file system, while datanode stores and retrieves block. In general, a block will be stored on multiple different datanode to improve fault tolerance. When datanode reads and writes HDFS files, it needs namenode to know the specific location of reading and writing.
You can use the distcp command to copy large files in parallel between different datanode:
$hadoop distcp hdfs://datanode1/foo hdfs://datanode2/bar
Files on HDFS are located using URI, and the prefix is hdfs://localhost:9000. You can assign this prefix to the attribute fs.default.name (the attribute can be specified in the configuration file or in the code), so that you don't have to write this prefix every time. For example, the following two commands are equivalent:
$hadoop fs-ls /
$hadoop fs-ls hsfs://localhost:9000/
The prefix of the local file system is file://
Orisun@zcypc:~$ hadoop fs-ls file:///Found 22 itemsdrwxr-xr-x-root root 4096 2012-08-02 19:17 / homedr-xr-xr-x-root root 0 2012-08-20 22:14 / procdrwxr-xr-x-root root 4096 2010-04-23 18:11 / mntdrwx--root root 4096 2012-08-18 10:46 / rootdrwxr-xr-x-root root 4096 2012-08-18 10:40 / sbin.
The default number of file backups for HDFS is 3, which can be set in the dfs.replication property, which is set to 1 in pseudo-distributed mode because there is only one datanode. When you use the hadoop fs-ls command, you get something like:
Drwxr-xr-x-orisun supergroup 0 2012-08-20 14:23 / tmp
-rw- 1 orisun supergroup 4 2012-08-20 14:23 / tmp/jobtracker.info
Much like the ls command under UNIX, the second column is the number of replication, and the fifth column is the length of the file, in units of B (the length of the folder is 0, while the length of the directory in the UNIX file system is an integral multiple of 512B, because the space occupied by the directory is allocated in blocks, each block is 512B).
FSDataInputStream inherits from Java's DataInputStream and supports random read and write.
Public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable {} public interface Seekable {void seek (long pos) throws IOException; long getPos () throws IOException; boolean seekToNewSource (long targetPos) throws IOException;}
FSDataInputStream can also read a portion of a file from a specified location.
Public interface PositionedReadable {public int read (long position, byte [] buffer, int offset, int length) throws IOException; public void readFully (long position, byte [] buffer, int offset, int length) throws IOException; public void readFully (long position, byte [] buffer) throws IOException;}
If you want to create a new file on HDFS, you can use
Public FSDataOutputStream create (Path f) throws IOException
Note two points when using the create () function: the file must not exist before; it can incidentally create any number of levels of parent directories.
Sometimes you may need to use append () to create a file when it doesn't exist.
Public FSDataOutputStream append (Path f) throws IOException
Rename a file
Public void rename (String oldName,String newName)
Of course, you can also use mkdir to create directories.
Public boolean mkdirs (Path f) throws IOException
Because create () can incidentally create any multi-level parent directory, you won't use mkdir.
The getFileStatus () method of FileSystem can get the FileStatus of files and directories.
Path file = new Path ("/ dir/file"); FileStatus stat = fs.getFileStatus (file)
Then you can visit:
Stat.getPath () stat.getLen () stat.isLen () stat.getMogificationTime () stat.getReplication () stat.getBlockSize () stat.getOwner () stat.getReplication () stat.getBlockSize () stat.getGroup () stat.getPermission ()
In fact, all the above information is stored in namenode.
You can also get the FileStatus of all the files in a directory.
Public FileStatus [] listStatus (Path f) throws IOExceptionpublic FileStatus [] listStatus (Path f, PathFilter filter) throws IOExceptionpublic FileStatus [] listStatus (Path [] files) throws IOExceptionpublic FileStatus [] listStatus (Path [] files, PathFilter filter) throws IOException
When specifying a file, hadoop also supports globbing, which supports the following wildcard:
* 0 or more arbitrary characters
? any single character
[ab] [^ ab] [amurb] [^ amurb]
{exp1,exp2} matches exp1 or exp2
\ c escape
Fs.listStatus (new Path ("/ 2007 Universe *"), new RegexExcludeFilter ("^. * / 2007-12-31 $"))
All files from 2007 will be matched, but 2007-12-31 files will be dropped by filter.
Public boolean delete (Path f, boolean recursive) throws IOException
You can choose whether to enable recursive mode when deleting a directory.
It has been mentioned above that a large number of small files can consume a lot of namenode memory, so in this case we need to use Hadoop Archives (HAR) to archive the file into a large file.
$hadoop archive-archiveName orisun.har-p / user/orisun / user
Package all the files under / user/orisun into orisun.tar and put them in the / user directory.
You can also see which files are contained in a har file:
Orisun@zcypc:~$ hadoop fs-lsr har:///user/orisun.hardrwxr-xr-x-orisun supergroup 0 2012-08-20 16:49 / user/orisun.har/mse-rw-r--r-- 1 orisun supergroup 0 2012-08-20 16:49 / user/orisun.har/mse/list-rw-r--r-- 1 orisun supergroup 0 2012-08-20 16:49 / user/orisun.har/bookorisun@zcypc : ~ $hadoop fs-ls har:///user/orisun.har/mseFound 1 items-rw-r--r-- 1 orisun supergroup 0 2012-08-20 16:49 / user/orisun.har/mse/list
HAR is also a file system, and the complete mode of an Har URI is har://-/
Orisun@zcypc:~$ hadoop fs- lsr har://hdfs-localhost:9000/user/orisun.har/mse-rw-r--r-- 1 orisun supergroup 0 2012-08-20 16:49 / user/orisun.har/mse/list
To delete a har file, you must use the rmr command, not rm.
$hadoop fs-rmr / user/orisun.har
Some restrictions on using HAR:
A full backup of the original file is generated, taking up disk space. Of course, you can delete the original file after you have created the har file.
HAR just packages multiple files into one file and does not use any compression strategy.
HAR files are immutable, so if you want to add or delete a file from har, you can only re-file it.
InputFormat ignores the existence of har, which means that har files still produce multiple InputSlit for MapReduce, which does not improve efficiency. To solve the problem of "too many small files lead to a lot of map task", you can use CombineFileInputFormat.
This is the end of this article on "what is suitable for HDFS?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.