What are the shortcomings of HDFS and its improvement strategies 07/02 Update SLTechnology News&Howtos

What are the shortcomings of HDFS and its improvement strategies

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you what are the shortcomings of HDFS and improvement strategies, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to understand it!

HDFS is a good distributed file system, it has many advantages, but also has some disadvantages. At present, it is inefficient in the following aspects:

Low latency access

HDFS is not suitable for applications that require low latency (tens of milliseconds) access because HDFS is designed for high-throughput data at the cost of a certain latency. HDFS is a single Master, and all requests for files go through it. When there are too many requests, there must be a delay. Currently, HBase is a better choice for applications with low latency requirements. Now the version of HBase is 0.20, which is a big improvement over the previous version, and its slogan is goes real time.

Using caching or multi-master designs can reduce data request pressure on client to reduce latency. Then there are the internal changes to the HDFS system, which requires a tradeoff between high throughput and low latency. HDFS is not a silver bullet.

A large number of small files

Because Namenode places the metadata of the file system in memory, the number of files that the file system can hold is determined by the memory size of the Namenode. Generally speaking, each file, folder, and Block takes up about 150 bytes of space, so if you have 1 million files, each occupying one Block, you need at least 300MB memory. At present, millions of files are feasible, and when expanded to billions, it will not be possible at the current hardware level. Another problem is that because the number of Map task is determined by splits, too much Map task will be generated when using MR to process a large number of small files, and thread management overhead will increase job time. For example, dealing with 10000m files, if each split is 1m, there will be 10000 Map tasks, which will have a lot of thread overhead; if each split is 100m, there will be only 100m Map tasks, each Map task will have more things to do, and the thread management overhead will be greatly reduced.

There are a number of ways for HDFS to handle small files:

1, the use of SequenceFile, MapFile, Har and other ways to archive small files, the principle of this method is to manage small files archived, HBase is based on this. For this method, if you want to retrieve the contents of the original small file, you must know the mapping to the archived file.

2. Scale-out, a Hadoop cluster can manage limited small files, so drag several Hadoop clusters behind a virtual server to form a large Hadoop cluster. Google has done the same thing.

3. Multi-Master design, this effect is obvious. The GFS II under development will also be changed to a distributed multi-Master design, and it also supports Master's Failover, and the Block size has been changed to 1m, intentionally optimizing the handling of small files.

It comes with an Alibaba DFS design, which is also a multi-Master design, which separates the mapping storage and management of Metadata, and consists of multiple Metadata storage nodes and a query Master node.

Alibaba DFS (currently unable to download, add group 60534259 (Hadoop technology exchange), group sharing has the following:)

Multi-user write, arbitrary file modification

Currently, Hadoop only supports single-user writing, not concurrent multi-user writing. You can use the Append operation to add data to the end of the file, but modifications are not supported anywhere in the file. These features may be added in future releases, but the addition of these features will reduce the efficiency of Hadoop. Take GFS, for example. This article says that google's own people are unhappy with using Multiple Writers.

Use distributed coordination services such as Chubby and ZooKeeper to solve the consistency problem.

These are all the contents of this article entitled "what are the shortcomings and improvement strategies of HDFS". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.