What is Hadoop? 07/03 Update SLTechnology News&Howtos

What is Hadoop?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what is Hadoop". In daily operation, I believe many people have doubts about what is Hadoop. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what is Hadoop?" Next, please follow the editor to study!

In 2011, we searched Baidu for only a few Hadoop-related questions every day. In 2015, we searched Baidu for more than 8 million questions about Hadoop. Now it has exceeded 100 million, and Hadoop has become a necessary infrastructure for big data. Hadoop is recognized as a set of industry big data standard open source software, which provides the ability to process massive data in a distributed environment. Almost all major vendors revolve around Hadoop development tools, open source software, commercial tools and technical services. In recent years, large IT companies, such as EMC, Microsoft, Intel, Teradata and Cisco, have significantly increased their investment in Hadoop. So what exactly is Hadoop? What does it do? What is its infrastructure like? Today, let's do a simple carding of these basic concepts of Hadoop.

What is Hadoop?

Hadoop is a distributed system infrastructure developed by the Apache Foundation, which is a software framework of storage system + computing framework. It mainly solves the problem of massive data storage and computing, which is the cornerstone of big data's technology. Hadoop processes data in a reliable, efficient and scalable way. Users can develop distributed programs without knowing the underlying details of distribution. Users can easily develop and run applications that deal with massive data on Hadoop.

What problems can be solved by Hadoop

1. Massive data storage

HDFS has high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (High throughput) to access data, which is suitable for applications with very large data sets (large data set). It consists of n machines running DataNode and 1 (another standby) running NameNode process. Each DataNode manages a portion of the data, and then the NameNode manages the information (storing metadata) of the entire HDFS cluster.

2. Resource management, scheduling and allocation

Apache Hadoop YARN (Yet Another Resource Negotiator, another resource coordinator) is a new Hadoop resource manager. It is a general resource management system and scheduling platform, which can provide unified resource management and scheduling for upper-level applications. Its introduction has brought great benefits to the cluster in terms of utilization, unified resource management and data sharing.

What is the architecture of Hadoop components?

After reading the basic introduction of Hadoop. Let's take a look at the core architecture and principles of HDFS and YARN, starting with the HDFS framework diagram:

After looking at the picture above, let's think about a few questions:

1. What is the metadata information, how does NameNode maintain the metadata, and how does the metadata information ensure consistency?

NameNode maintains the metadata information of the HDFS cluster, including the directory tree of files, the list of blocks corresponding to each file, permission settings, the number of copies, and so on.

If the metadata information is stored in memory, what should I do in case of abnormal downtime of NameNode?

NameNode's modification of metadata consists of two parts.

Memory data modification

Write an EditLog after modifying the memory

Let's look at two concepts, FsImage and EditLog:

FsImage:FsImage is a mirror file of metadata in NameNode memory and a permanent checkpoint of metadata. It contains the serialization information of all the directories and files of HDFS. It can be compared to the bank account balance and has only simple information.

EditLog:EditLog is used to connect the operation log between memory metadata and FsImage. Since the last checkpoint, all operations on the HDFS file system, such as adding files, renaming files, deleting directories, etc., can be compared to bank account pipelining, including each record. If accumulated over time, the pipelining information can be very large.

So if the Editlog becomes very large, you need to read the Editlog to recover the metadata after downtime, which is a very slow process. It's time for the StandbyNameNode node to play. The Standby node pulls the Editlog from the JournalNode collection and periodically merges the Editlog into FsImage. FsImage is a merged stock data information. Upload the FsImage to the ActiveNode node at the same time.

2. How do I switch between NameNode Active and standby and always maintain an ActiveNode?

As we can see in the HDFS framework diagram above, the component ZKFC that links the ZK cluster to NameNode

1. ZKFC monitors the monitoring status of NameNode

2. ZKFC uses the election of active and standby nodes provided by ZK to switch.

3. Notify and modify the status of NameNode

4. Confirm that services are provided after the completion of metadata synchronization.

Let's take a look at the YARN framework diagram:

The figure above depicts the process of submitting and allocating resources for a task in YARN. The following components are involved in the whole process:

ResourceManeger: responsible for monitoring, allocating and managing all resources, and handling client requests, starting and monitoring AppMaster,NodeManager

NodeManager: resource management and task management on a single node, dealing with ResourceManager,AppMaster commands

AppMaster: responsible for scheduling and coordinating a specific application, requesting resources for the application, and monitoring tasks

The concept of dynamic resource allocation in Container:YARN, which has a certain number of memory and cores.

The overall process of a task submission:

(1) Client submits the application to YARN

It includes ApplicationMaster programs, commands, user programs, resources, and so on.

(2) ResourceManager assigns the first Container to the application and communicates with the corresponding NodeManager, asking it to start the application's ApplicationMaster in this Container.

(3) ApplicationMaster first registers with ResourceManager, so that users can view the running status of the application directly through ResourceManager, and then it will request resources for each task and monitor its running status.

(4) ApplicationMaster applies for and receives resources from ResourceManager through RPC protocol by polling.

(5) once the ApplicationMaster requests a resource, it communicates with the corresponding NodeManager, asking it to start the task.

(6) after NodeManager sets up the running environment for the task (including environment variables, Jar package, binary program, etc.), write the task startup command into a script, and start the task by running the script.

(7) each task reports its status and progress to ApplicationMaster through a certain RPC protocol, so that ApplicationMaster can keep abreast of the running status of each task at any time, so that it can restart the task when it fails. During the running of the application, users can query the current running status of the application to ApplicationMaster at any time through RPC.

(8) after the application is run, ApplicationMaster logs out to ResourceManager and closes itself.

Through the above, you can get some simple impressions of some of the basic frameworks of Hadoop. After that, you can have an in-depth understanding of the above structure diagram and the Hadoop official website or community when using it.

At this point, the study of "what is Hadoop" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.