What is the design concept of Hadoop 04/21 Update SLTechnology News&Howtos

What is the design concept of Hadoop

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "What is the concept of Hadoop design". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let Xiaobian take you to learn "What is the Hadoop design concept"!

Introduction to Hadoop

Apache Hadoop is currently the most popular software framework for distributed storage and processing of large data sets using a simple high-level programming model. Hadoop is an open source project of the Apache Software Foundation that can be installed on a cluster of servers so that these servers can communicate and work together to store and process large data sets. Hadoop has become very successful in recent years due to its ability to efficiently process big data. It allows companies to store all their data in one system and analyze that data, which would otherwise be impossible or expensive with traditional solutions.

Many of the tools built around Hadoop provide a wide variety of processing technologies. Integration with ancillary systems and utilities is excellent, making the actual work of Hadoop easier and more efficient. Together, these tools make up the Hadoop ecosystem.

You can think of Hadoop as a big data operating system, so you can run different types of workloads on all the huge data sets. It ranges from offline batch processing to machine learning to real-time stream processing.

Hadoop Design Concept

To address the challenges of processing and storing large data sets, Hadoop is built around the following core characteristics:

Storage and processing is not built on a large supercomputer, but distributed across a set of small machines that communicate and work together.

Horizontal Scalability-Hadoop clusters can be easily scaled by simply adding new computers. Each new machine increases the total storage and processing power of the Hadoop cluster proportionally.

Fault tolerance-Hadoop continues to function even if a few hardware or software components fail.

Cost optimization- Hadoop does not require expensive high-end servers and works properly without commercial licenses.

Programming abstractions- Hadoop handles all the messy details associated with distributed computing. With advanced APIs, users can focus on implementing business logic that solves real-world problems.

Data Location- Hadoop does not move large data sets to the location where applications are running, but rather runs applications where the data already exists.

III. Hadoop components

Hadoop is divided into two core components: HDFS distributed file system and YARN cluster resource management technology.

1、HDFS：

HDFS is a Hadoop distributed file system. It can run on as many servers as you need- HDFS can easily scale to thousands of nodes and petabytes of data. The higher the HDFS setting, the greater the probability that certain disks, servers, or network switches will fail. HDFS survives these types of failures by replicating data across multiple servers. HDFS automatically detects whether a given component has failed and takes the necessary recovery actions that occur transparently to the user.

HDFS is designed to store large files of hundreds of megabytes or gigabytes and provide them with high-throughput streaming data access. Last but not least, HDFS supports the write-once-read-many model. HDFS is like a charm for this use case. However, if you need to store a large number of small files with random read and write access, other systems such as RDBMS and Apache HBase can do a better job.

2、YARN：

YARN (Yet Another Resource Negotiator) manages resources on Hadoop clusters and supports running distributed applications that process data stored on HDFS. Similar to HDFS, YARN follows a master-slave design, with the ResourceManager process acting as the master node and multiple NodeManagers acting as workers. They have the following responsibilities:

（1）ResourceManager

Track real-time NodeManagers and the amount of compute resources available on each server in the cluster. Allocate available resources to the application. Monitor the execution of all applications on a Hadoop cluster.

(2) Node Manager

Manage compute resources (RAM and CPU) on individual nodes in a Hadoop cluster. Runs tasks for various applications and forces them to be within the limits of specified compute resources.

YARN allocates cluster resources to various applications in the form of resource containers, which represent a combination of the amount of RAM and the number of CPU cores.

Hadoop = HDFS + YARN

The HDFS and YARN daemons running on the same cluster provide us with a powerful platform for storing and processing large data sets.

At this point, I believe that everyone has a deeper understanding of "what is the Hadoop design concept". Let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.