What technical support does hadoop need? 07/15 Update SLTechnology News&Howtos

What technical support does hadoop need?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is to share with you about what technical support hadoop needs. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Hadoop is an open source software framework that can be installed in a business machine cluster so that machines can communicate with each other and work together to store and process large amounts of data in a highly distributed manner. Initially, Hadoop consists of two main components: Hadoop Distributed File System (HDFS) and a distributed computing engine that supports implementing and running programs in the form of MapReduce jobs.

Hadoop also provides the software infrastructure to run MapReduce jobs as a series of map and reduce tasks. The Map task calls the map function on a subset of the input data. After completing these calls, the reduce task starts calling the reduce task on the intermediate data generated by the map function to generate the final output. Map and reduce tasks run separately from each other, which supports parallel and fault-tolerant computing.

Most importantly, the Hadoop infrastructure handles all the complex aspects of distributed processing: parallelization, scheduling, resource management, machine-to-machine communication, software and hardware fault handling, and so on. Thanks to this clean abstraction, implementing distributed applications that process TB data on hundreds (or even thousands) of machines has never been easier, even for developers who have no previous experience with distributed systems.

Map reduce process diagram

Shuffle combine

The whole Shuffle process consists of the following parts: map-side Shuffle, Sort stage, reduce-side Shuffle. In other words, the Shuffle process spans both map and reduce, with the sort phase in the middle, which is the process of data output from map task to reduce task input.

Sort and combine are on the map side, and combine is an advance reduce, which needs to be set by yourself.

In a Hadoop cluster, most of the map task and reduce task are executed on different nodes. Of course, in many cases, Reduce needs to pull map task results on other nodes across nodes. If there are a lot of job running in the cluster, the normal execution of task will consume a lot of network resources within the cluster. For the necessary consumption of network resources, the ultimate goal is to maximize the reduction of unnecessary consumption. And within the node, compared with memory, the impact of disk IO on job completion time is also considerable. In terms of the most basic requirements, for the Shuffle process of job performance tuning for MapReduce, the goals are expected to be:

Pull the data completely from the map task side to the reduce side.

When pulling data across nodes, the unnecessary consumption of bandwidth is reduced as much as possible.

Reduce the impact of disk IO on task execution.

Generally speaking, this Shuffle process can be optimized mainly by reducing the amount of data pulled and using memory instead of disk as much as possible.

YARN

ResourceManager replaces cluster manager

ApplicationMaster replaces a dedicated and short-lived JobTracker

NodeManager replaces TaskTracker

A distributed application replaces a MapReduce job

A global ResourceManager runs as the main background process, which usually runs on a dedicated machine and mediates available cluster resources between competing applications.

When a user submits an application, a lightweight process instance called ApplicationMaster starts to coordinate the execution of all tasks within the application. This includes monitoring tasks, restarting failed tasks, speculative slow tasks, and calculating the sum of application counter values. Interestingly, ApplicationMaster can run any type of task within the container.

NodeManager is a more common and efficient version of TaskTracker. There is no fixed number of map and reduce slots,NodeManager with many dynamically created resource containers.

Big data Hadoop developers include Amazon Web Services, Cloudera, Hortonworks, IBM, MapR Technology, Huawei and Fast search. These vendors are based on Apache open source projects, and then add packaging, support, integration and other features, as well as their own innovation.

The fast big data General Computing platform (DKH) has integrated all the components of the development framework with the same version number. If a fast development framework is deployed on the open source big data framework, the components of the platform need to be supported as follows:

Data sources and SQL engines: DK.Hadoop, spark, hive, sqoop, flume, kafka

Data acquisition: DK.hadoop

Data processing module: DK.Hadoop, spark, storm, hive

Machine Learning and AI:DK.Hadoop, spark

NLP module: upload server-side JAR package for direct support

Search engine module: do not publish independently

Thank you for reading! This is the end of this article on "what technical support does hadoop need?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.