What are the design features of Hadoop 07/15 Update SLTechnology News&Howtos

What are the design features of Hadoop

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the design features of Hadoop". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let Xiaobian take you to learn "What are the design features of Hadoop"!

History of Hadoop

The prototype began in 2002 with Apache's Nutch, an open source Java implementation of the search engine. It provides all the tools we need to run our own search engine. Includes full-text search and Web crawler.

Then in 2003 Google published a technical academic paper called Google File System (GFS). GFS is Google File System, a file system designed by Google to store massive amounts of search data.

In 2004 Nutch founder Doug Cutting implemented a distributed file storage system called NDFS based on Google's GFS paper.

In 2004 Google published a technical paper MapReduce. MapReduce is a programming model for parallel analytical operations on large-scale datasets (greater than 1TB).

In 2005 Doug Cutting was based on MapReduce, implemented in Nutch search engine.

In 2006 Yahoo hired Doug Cutting, who upgraded NDFS and MapReduce to Hadoop, and Yahoo started a separate team dedicated to Hadoop for Doug Cutting.

Google and Yahoo have contributed a lot to Hadoop.

Hadoop core

The core of Hadoop is HDFS and MapReduce, which are only theoretical foundations, not specific advanced applications. Hadoop has many classic subprojects, such as HBase, Hive, etc., which are developed based on HDFS and MapReduce. To understand Hadoop, you must know what HDFS and MapReduce are.

HDFS

HDFS (Hadoop Distributed File System) is a highly fault-tolerant system suitable for deployment on inexpensive machines. HDFS provides high-throughput data access for applications with very large data sets.

HDFS design features are:

1, large data files, very suitable for T-level large files or a bunch of large data file storage, if the file is only a few G or even smaller, it is meaningless.

2, file block storage, HDFS will be a complete large file evenly divided into blocks stored on different calculators, its significance is that when reading files can be taken from multiple hosts at the same time different blocks of files, multi-host reading than single host reading efficiency is much higher.

3, streaming data access, write multiple times read and write, this mode is different from traditional files, it does not support dynamic change of file content, but requires that the file does not change once written, to change can only add content at the end of the file.

4, cheap hardware, HDFS can be applied to ordinary PCs, this mechanism can give some companies dozens of cheap computers can support a large data cluster.

HDFS believes that all computers may have problems. In order to prevent a host from failing to read the block file of the host, it distributes the copy of the same file block to several other hosts. If one of the hosts fails, it can quickly find another copy to get the file.

Key elements of HDFS:

Block: Divide a file into blocks, usually 64 megabytes.

NameNode: Save the directory information, file information and block information of the entire file system, which is specially saved by a single host. Of course, if this host has an error, NameNode will be invalid. In Hadoop2.* Activity-standby mode is supported---If the primary NameNode fails, the standby host is started to run the NameNode.

DataNode: Distributed on inexpensive computers to store Block files.

MapReduce

MapReduce is a programming model that extracts analysis elements from massive source data and finally returns result sets. Distributing files to hard disk is the first step, and extracting and analyzing what we need from massive data is what MapReduce does.

Here is an example of calculating the maximum value of a large amount of data: a bank has hundreds of millions of depositors. The bank wants to find out what the maximum amount of deposits is. According to the traditional calculation method, we would like this:

Java code

Long moneys[] ...

Long max = 0L;

for(int i=0;imax){

max = moneys[i];

}

If the array length of the calculation is small, this implementation will not have problems, or will it have problems when faced with large amounts of data.

MapReduce does this: first, the numbers are stored in different blocks, some blocks are a Map, calculate the maximum value in the Map, and then reduce the maximum value in each Map, Reduce and then take the maximum value to the user.

The basic principle of MapReduce is to divide large data analysis into small pieces and analyze them one by one, and finally summarize and analyze the extracted data to finally get what we want. Of course, how to partition analysis, how to do Reduce operation is very complex, Hadoop has provided the implementation of data analysis, we only need to write simple requirements to achieve the data we want.

At this point, I believe everyone has a deeper understanding of "what are the design features of Hadoop". Let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.