How does hadoop work? 07/01 Update SLTechnology News&Howtos

How does hadoop work?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, Xiaobian will bring you about the operation principle of hadoop. The article is rich in content and analyzes and narrates it from a professional perspective. After reading this article, I hope you can gain something.

hadoop consists of three main components:

1、HDFS

2、MapReduce

3、Hbase

The core design of the Hadoop framework is MapReduce and HDFS. The idea of MapReduce was mentioned in a Google paper and widely circulated. In a simple sentence, MapReduce means "decomposition of tasks and aggregation of results." HDFS stands for Hadoop Distributed File System and provides the underlying support for distributed computing storage.

MapReduce can be roughly seen from its name. The two verbs Map and Reduce,"Map" is to decompose a task into multiple tasks, and "Reduce" is to summarize the results of multi-task processing after decomposition to obtain the final analysis result. This isn't a new idea, but it can be found in the multithreaded, multitasking design mentioned earlier. Whether in real society or in programming, a job can often be split into multiple tasks, and the relationship between tasks can be divided into two types: one is unrelated tasks that can be performed in parallel; the other is that there is mutual dependence between tasks, and the order cannot be reversed. Back in college, professors asked everyone to analyze critical paths in class, nothing more than finding the most time-saving way to break down tasks. In distributed systems, machine clusters can be regarded as hardware resource pools, splitting parallel tasks and then handing them over to each idle machine resource to process, which can greatly improve computing efficiency. At the same time, this resource independence undoubtedly provides the best design guarantee for the expansion of computing clusters. (Actually, I always thought that the cartoon icon of Hadoop should not be a small elephant, it should be an ant, distributed computing is like ants eating elephants, cheap machines can compete with any high-performance computer, vertical expansion curve is always better than horizontal expansion diagonal). After the task is decomposed and processed, it is necessary to summarize the results after processing. This is what Reduce has to do.

Here's a classic picture:

HDFS has three key roles: NameNode, DataNode, and Client. NameNode can be regarded as an administrator in a distributed file system, mainly responsible for managing the namespace of the file system, cluster configuration information, and replication of storage blocks. NameNode stores Meta-data of the file system in memory. This information mainly includes file information, information of file blocks corresponding to each file, and information of each file block in DataNode. DataNode is the basic unit of file storage. It stores blocks in the local file system, saves Meta-data of blocks, and periodically sends all existing Block information to NameNode. A Client is an application that needs to fetch files from a distributed file system. Three operations illustrate the interaction between them.

File write:

a):Client initiates a file write request to NameNode.

b):NameNode returns the information of the DataNode managed by it to the Client according to the file size and file block configuration.

c):Client divides the file into multiple blocks and writes them into each DataNode block in sequence according to the address information of DataNode.

File reading:

a):Client initiates a file read request to NameNode.

b):NameNode Returns information about the DataNode stored in the file.

c):Client reads file information.

File Block Copy:

a):NameNode found that some file blocks do not meet the minimum number of copies or some DataNodes are invalid.

b): Notify DataNodes to copy Blocks to each other.

c): DataNodes start copying each other directly.

Here's a look at the structure of Hadoop combined with MapReduce and HDFS:

Hadoop structure diagram

In a Hadoop system, there will be a Master, which is mainly responsible for the work of NameNode and JobTracker. JobTracker's primary responsibility is to initiate, track, and schedule the execution of individual Slave tasks. There will also be multiple Slaves, each of which typically has DataNode functionality and is responsible for TaskTracker work. TaskTracker performs Map tasks and Reduce tasks in conjunction with local data according to application requirements.

It may be better to explain the code posted in your usual work:

FileInputFormat.setInputPaths(tempJob, hdfsHome); //reads local files into HDFS

LOG.info(tempJobName + " data start ..... ");

tempJob.setJarByClass(tempMain.class); //Set which class this Job runs on.

tempJob.setMapperClass(MultithreadedMapper.class); //Set the map where the job runs, which uses the multithreaded implementation mechanism that comes with the map itself. This is very important and can help us improve the efficiency of the operation.

MultithreadedMapper.setMapperClass(tempJob,tempMapper.class); //Set the map where this job runs

MultithreadedMapper.setNumberOfThreads(tempJob, Integer.parseInt(tempThread));//Set multithreading to run several threads

tempJob.setMapOutputKeyClass(LongWriteable.class);//Set the key output by map

tempJob.setMapOutputValueClass(StringArrayWriteable.class); //Set the value output by map

...... //There are also some reduce below. I don't cover the business here, so I won't list them here.

long start = System.currentTimeMillis();

boolean result = tempJob.waitForCompletion(true); //Start a job run

long end = System.currentTimeMillis();

The above is how the operating principle of hadoop shared by Xiaobian is. If there is a similar doubt, please refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.