Infrastructure of Hadoop 07/08 Update SLTechnology News&Howtos

Infrastructure of Hadoop

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The name Hadoop is not new to many developers now, but many developers don't know how it works and how it works. How Hadoop implements distributed storage and distributed computing, and why its computing performance is improved so much. This article will explain the above problems from its basic working principle. The blogger is a beginner, do not spray if you don't like it, and ask your predecessors for more advice.

I. the origin of the name Hadoop

When you search for the Hadoop keyword on the Internet, most of the pictures appear are cute baby elephants, and many people don't understand the relationship between this distributed platform and the baby elephant. Hadoop is Apache's open source implementation of Google GFS (Google FileSystem) and MapReduce. Doug Cutting, the head of the project team, later explained that, in fact, the word "Hadoop" is not in the dictionary, it is a virtual word, derived from his youngest son, his youngest son has a baby elephant toy and calls it Hadoop (pronounced similarly). Doug Cutting thinks this is very interesting, simple spelling, easy to remember. And it is not easy to be used elsewhere, and it has no other meaning, so it is very suitable as the name of this open source project, so the name is used, and * elephant has become synonymous with Hadoop.

What is Hadoop?

First of all, let's understand what distributed computing is, the so-called distributed computing, that is to say, a very responsible calculation or a large amount of calculation is split into many small calculations at the same time, so as to improve the efficiency of computation and reduce the time cost. It is corresponding to centralized computing, which is often associated with computer clusters and relies on computing resources. Hadoop is more officially said that Hadoop is a distributed system infrastructure developed by the Apache Foundation (Baidu encyclopedia). To put it simply, Hadoop is a product in the distributed system architecture, and it is also a product with complete functions, the most convenient to use and the most extensive.

Hadoop mainly implements distributed File system (HDFS) and its corresponding distributed computing model MapReduce. Hadoop can be deployed on cheap machines, and can ensure high fault tolerance, computer clusters can ensure high data throughput, and can make full use of MapReduce distributed computing model to achieve mass data computing, which makes many enterprises more and more inclined to use Hadoop to develop their own products.

III. The structure of Hadoop

From the perspective of Hadoop itself, as mentioned above, Hadoop is mainly composed of two parts, the distributed file system HDFS and the distributed computing model MapReduce,HDFS are located at the bottom of Hadoop, which mainly provide document storage function, the upper layer is MapReduce, and the data calculation source is mainly the data in HDFS. The output results of calculation are often saved in HDFS or locally, which will be described in detail later.

HDFS

From the user client's point of view, HDFS is like an ordinary file system, which is very similar to the file system function in Linux system, and its file operation is also similar to the file operation command in linux system. Therefore, friends who are familiar with linux are no stranger to Hadoop, and the operation introduction is convenient, which is one of the reasons why Hadoop is widely accepted by developers.

But from the system point of view, it is not so simple. HDFS is composed of many servers, and multiple servers work together to complete the distributed storage function. File data is stored in block, and the size of each block is 64m by default (configuration changes can be modified). That is to say, a large file will be cut into 64m-sized block and stored in HDFS, and the part ending less than 64m will occupy an exclusive block, and each block has three backups by default (the number of backups can be modified to change the configuration), that is, data redundancy. HDFS has many advantages:

Easy to scale: because it is deployed on a cluster, its capacity expands with the expansion of the cluster, and it is easy to expand to tens of millions of nodes if you configure high-quality cluster management.

(1) High throughput: because of its close relationship with the cluster and good scalability, HDFS can easily store T-level or more data, as long as the cluster can fit.

(2) High reliability: due to the data redundancy strategy adopted by HDFS, every block has a backup, which not only causes some server outages, but also does not affect the integrity of the data.

(3) High efficiency: because its data is distributed, and block can move quickly between nodes to maintain node balance, so it supports distributed computing very efficiently.

(4) cheap: because Hadoop is designed to deploy distributed computing on cheap servers, many small companies can afford this.

Disadvantages:

HDFS is not suitable for real-time computing, its data is written once, read many times, and the data access time delay is relatively large, its original design intention is high throughput, so real-time performance is sacrificed.

MapReduce

MapReduce is another core technology of Hadoop: distributed computing, which is actually very simple, is to make use of the characteristics of HDFS distributed storage, on this basis, to establish a sub-computing on each block or file logic unit, sacrificing computing space for computing time. The calculation model is divided into two steps, namely Map (mapping) and Reduce (collection). The Map part establishes a mapping for each piece of data and carries on the simple calculation, and the Reduce part collects the output results of Map to form the output. The details will be described in detail later.

From a cluster perspective, Hadoop is composed of NameNode,DataNode,SecondaryNameNode and Client, as shown in the following figure (image from Baidu encyclopedia):

NameNode is the master node of the Hadoop cluster and the control node of the cluster. SecondaryNameNode is the secondary master node, which can be understood as the backup of NameNode. When the primary node dies, the SecondaryNameNode is added as the primary node. The master node itself does not store a large amount of data or provide computing resources, but only provides a namespace for the cluster, in which file mapping, namely Metadata, is saved, which is a way to find files in the file system. DataNode is a data node in a cluster, which is used to store large quantities of data and provide computing resources. MapReduce actually runs in DataNode, and NameNode only controls it. Client is the client. The communication in the cluster is done in the way of TCP/IP.

The architecture of Hadoop is introduced here first, and it will be added later if necessary. Thank you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.