What is the problem that Hadoop can't solve? 07/13 Update SLTechnology News&Howtos

What is the problem that Hadoop can't solve?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you what the problems that Hadoop can't solve are. The content is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Because of the needs of the project, I learned to use Hadoop, and like all overheated technologies, words such as "big data" and "magnanimity" flew all over the Internet. Hadoop is an excellent distributed programming framework, well-designed and currently there are no alternatives of the same level and weight. In addition, we also come into contact with an internally used framework that encapsulates and customizes Hadoop to better meet business needs. Recently, I also want to write some experience in learning and using Hadoop, but seeing so many articles on the Internet, I think it is really worthless to write something like notes. It would be better to calm down while caroling all over the world and see what problems are not suitable for Hadoop?

This diagram is the architecture diagram of Hadoop. Map and Reduce are the two most basic processing stages, with input data format definition and data slicing before, followed by output data format definition. Between the two, you can also implement the local reduce operation of combine and the policy behavior of partition to redirect mapper output. Customizations and enhancements that can be added include:

The enhancement of input and output data, for example, through data set management, can unify and merge various data sets, and even add filtering operations to the data as primary screening. In fact, there are many kinds of core data sources in the business.

With the expansion of data slicing strategy, we often need to deal with data that have some business commonalities.

The main reason for the extension of combine and partition is that some policy implementations are common in many Hadoop job.

The expansion of monitoring tools, in which I have also seen customized tools within other companies.

Enhancements to communication protocols and file systems, especially file systems, can be used as close to local commands, and such customizations can also be found on the Internet

The further encapsulation of the programming interface for data access is mainly to be more business-friendly and convenient to use.

To some extent, these customizations also reflect the limitations of Hadoop in actual use or lack of time in design, but these are minor problems that can be fixed through customization and extension. But there are some problems that Hadoop inherently cannot solve, or that is not suitable for using Hadoop to solve.

1. Most importantly, the problems that Hadoop can solve must be MapReduce-capable. There are two special meanings here. One is that the problem must be split. Some problems look big, but it is very difficult to split them. The second is that sub-problems must be independent-many Hadoop textbooks cite an example of a Fibonacci series. The operation of each step of data is not independent, but must rely on the results of the previous step and the first two steps. In other words, it is impossible to divide a big problem into separate minor problems, and there is no way to use Hadoop at all.

2. The data structure does not satisfy the pattern such as key-value. In Hadoop In Action, the author compares Hadoop with relational databases, and structured data queries are not suitable for implementation in Hadoop (although things like Hive mimic the syntax of ANSI SQL). Even so, the performance overhead is not comparable to that of a general relational database, and queries with more complex combination conditions are not as powerful as SQL. Writing code calls is also time-consuming.

3. Hadoop is not suitable for processing large quantities of small files. In fact, this is determined by the limitations of namenode, if the file is too small, the meta-information stored in namenode will relatively take up too much space, memory or disk overhead is very large. If the file processing of a task is large, then the preparation time for virtual machine startup, initialization, and so on, and the cleanup time after the task is completed, and even the time spent by the framework, such as shuffle, is much smaller; otherwise, the throughput of the processing falls. (someone did an experiment, see: link)

4. Hadoop is not suitable for tasks that require timely response and tasks with high concurrent requests. This is also easy to understand. I've already talked about virtual machine overhead, initialization preparation time, and so on, and it takes a few minutes to run the job completely without doing anything in the task.

5. Hadoop has to deal with the real "big data" and turn scale up into scale out, two small broken machines, or a few or ten GB of this amount of data, it would be clumsy to use Hadoop. It is obvious that asynchronous systems themselves are not as intuitive as those synchronous systems. So basically, the cost of maintenance will not be low.

The above is what kind of problem Hadoop can't solve. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.