What is the development trend of MapReduce and Hadoop? 04/21 Update SLTechnology News&Howtos

What is the development trend of MapReduce and Hadoop?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Editor to share with you about the development trend of MapReduce and Hadoop, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

MapReduce and Hadoop

Hadoop

Hadoop is an open source distributed computing platform, which is mainly composed of MapReduce algorithm execution and a distributed file system. InfoQ once published a summary article written by JeremyZawodny about the speed increase of Hadoop. This time, ScottDelap, senior Java editor of InfoQ, and DougCutting, Hadoop project leader, conducted an exclusive interview. In this InfoQ interview, Cutting discusses how Hadoop is used in Yahoo, as well as the challenges encountered in the development of Hadoop and the future direction of the Hadoop project.

ScottDelap (SD): does Hadoop already serve some of the functions of Yahoo as an official product? If not, what are the plans for Hadoop to migrate from an experimental product to core infrastructure components?

DougCutting (DC): Yahoo regularly uses Hadoop in the search business to improve its products and services, such as ranking functions and targeted advertising. In addition, there are also some cases where Hadoop is used directly for data generation. The long-term goal of Hadoop is to provide distributed computing tools, as well as Web extension (web-scale) services that support next-generation services such as search results analysis.

What is the size of the SD:Yahoo team responsible for the Hadoop project? How many other active code contributors are there besides Yahoo insiders?

DC:Yahoo has a special team directly responsible for the development of Hadoop, while active contributors to Apache open source projects generally have their own careers. Even so, there are still some non-Yahoo staff who make their own contributions to Hadoop on a monthly, weekly or even daily basis.

SD: a different approach to a scalable infrastructure than Google,Yahoo insists. Although Google has published many technical papers, its significance to the general public is not very obvious. And why do you think open source is the right direction?

DC: the operation of an open source project needs to meet two conditions: * everyone has a common understanding of what the project can do. Second, there is an easy-to-understand documentation solution. Because infrastructure software is widely used in many fields, this kind of open source software develops particularly well. Yahoo uses and supports such infrastructure software as FreeBsd, Linux, Apache, PHP, and MySQL. So that anyone can use Hadoop to help Yahoo improve the status quo and improve the current level of building large-scale distributed systems. The source code is only a small part of the problem, and in addition, an organization needs a very strong team of engineers to solve major problems and put them into practice. It is also important to have the ability to properly publish and manage the infrastructure. At present, few companies have all these necessary resources. As a result, software engineers are willing to work on open source projects, meet many like-minded friends in a huge community, learn some shared skills and apply them to other projects in the future. In such an excellent community environment, it is easy to train many new outstanding engineers. Both the Yahoo and Hadoop communities benefit from this collaborative mechanism to better understand what is needed for large-scale distributed computing and to share our expertise and technology to build a solution that everyone can use and modify.

SD: back to the technology itself, with the continuous development of Hadoop in recent years, what do you think are the factors that affect its speed and stability? I found that the ranking benchmark of 500 records is now 20 times faster than last year, is this due to the huge improvement of one part or the result of the joint optimization of multiple parts?

DC: in dealing with Web extension server software, Yahoo found that they all achieved similar performance as the number of other companies and organizations using this solution increased. Yahoo decided to open source it, rather than continue to develop it as proprietary software. So Yahoo hired me to lead the project. To date, Yahoo has contributed most of the code.

As for the improvement of speed, it is the sum of the efforts made in the past few years, and it has been tested again and again. In a server cluster of a given size, we can make the system run very smoothly, and then experiment with what happens in a server cluster of twice that size. Our goal is to increase performance linearly with the size of the cluster. We keep learning from this process and increase the size of the cluster again. With each increase in the size of the cluster, more and more types of errors will increase accordingly, so stability will be a major issue.

Each time we do this, we can understand what is achievable and what experience can be contributed to the open source public knowledge base of grid computing. With the increase of the size of the server cluster, a variety of new failures continue to occur, and rare errors become common errors, all of which need to be solved. And what we have learned in this process will affect our next trial and error.

SD: Hadoop has been running on AmazonEC2 since last year. This will allow developers to quickly build their own server clusters. So is there any extra work to do to manage such a cluster, HDFS, and MapReduce processing?

DC:Yahoo has a project called HOD (HadooponDemand) that allows Mapreduce to run on very ordinary machines. This is also an open source project in the process of construction. Because running a large cluster is very complex and resource-constrained, AmazonEC2 is a very good platform for ordinary people to get in touch with Hadoop.

SD: how do you objectively compare with Google's released products in terms of Hadoop functionality? Are there any new features in the optimization process from program unit to data unit?

DC: in the past decade, many large companies (including Yahoo) and some theoretical research institutions have been developing and studying large-scale distributed computing software. Recently, with the emergence of economic computing in the consumer market, the interest in this kind of development and research is even higher. Unlike Google, Yahoo has adopted the development of a completely open source Hadoop so that anyone can use and modify the software for free. Hadoop's goal has been extended beyond any existing technological replicas. We are committed to building Hadoop into a system that is useful to everyone. We have implemented most of what Google has released, plus a lot of other things that have not been mentioned. Yahoo will play a role in this project because its goals are very consistent with our needs, and we understand the significance of sharing this technology with the world.

The official version of SD:*** is 0.13.1. Will there be any major new features in the future? What kind of work will version 1.0 accomplish?

The DC:0.14.0 version will have as many as 218 changes. The change to the system is that we directly improve the integrity of the data. This is an invisible change for users, but it is effective for the future development of the whole system. Due to the size of the data and the cluster, both memory and disk problems occur frequently, which will be a crisis. We have also added the ability to change the file time, as well as some C++API functions of MapReduce, other features of the host, and the location and repair of bug.

Hadoop0.15.0 is also taking shape, with 88 changes planned. This version adds authentication and authorization to the file system, making access to information between the same server cluster more secure. We also plan to revise a large number of Mapreduce's API. 0.15.0 will be a very difficult version because it requires users to make changes to their applications, which we hope can be achieved in one step. We also hope that 0.15 will be a version of * * before 1.0. After 1.0 we will be very conservative and will not suddenly make big changes. We will also be very concerned about backward compatibility, which will be even more important for version 1.0. Any code written for version 1.0 will also continue to run after version 1.x. So we need to make sure that our existing API can be easily extended to future versions. We will try to implement these in version 0.15.

These are all the contents of this article entitled "what are the Development Trends of MapReduce and Hadoop?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.