MapReduce experiment (1) principle 10/17 Update SLTechnology News&Howtos

MapReduce experiment (1) principle

2025-10-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Official website

Http://hadoop.apache.org/

Three components of hadoop

HDFS: distributed storage system

Https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

MapReduce: distributed Computing system

Http://hadoop.apache.org/docs/r2.8.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

YARN: resource scheduling system of hadoop

Http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html

Recall that I did a project of laser track leveling of China Railway track in the past. the database size of a section of 50KM is 400G, which is big just to find space to copy it out. Now it can be carried out very conveniently with a distributed database and computing platform.

Mapper

The mapper maps the input key / value pair to a set of intermediate key / value pairs.

Mapping is a single task to convert an input record into an intermediate record. The converted intermediate record does not need to be of the same type as the input record. A given input pair can be mapped to zero or more output pairs. The task of Hadoop's MapReduce framework to generate a map is generated by each InputSplit working InputFormat. In general, the implementation of graphics is a method that is passed to the working setmapperclass (class) through the work. The framework invocation graph (writablecomparable, write, context) each key / value pair, the task in the InputSplit pair. The application can then override the cleanup (context) method to perform any necessary cleanup work. The output pair does not need to be of the same type as the input pair. A given input pair can be mapped to zero or more output pairs. The output is written to the called context (writablecomparable, writable).

Applications can use counters to report their statistics.

All intermediate values associated with a given output key are then grouped by the frame and passed to the reducer to determine the final output. The user can control the grouping through work by specifying a comparator. Setgroupingcomparatorclass (class). Sort the mapper output, and then partition each reducer. The total number of partitions is the same as the reduced number of tasks. The user can control where the key (and therefore record) goes by implementing a custom splitter. The user can choose to specify a synthesizer by working. Setcombinerclass (class), where the intermediate output is gathered, which helps to reduce the amount of data from the drawing to the reducer. The output of the intermediate sort is always stored in a simple (key, key, value, value) format. If the application can be controlled, the intermediate output is compressed and compressioncodec can be configured.

Reducer

Deceleration is the key to reducing a group of values with a smaller share of median values. The number of users working through the working group has been reduced. Setnumreducetasks (int). Generally speaking, the realization of the reducer is through the post work. The setreducerclass (class) method, which you can override to initialize itself. Framework calls are reduced (writablecomparable, one, in the grouped input pair. The application can override cleanup (context) to perform any required cleanup method. The reducer has three main stages: shuffling, sorting and reducing.

Shuffle shuffle

The mapping of the sort output of the input reducer. At this stage of the framework, all mappers output the corresponding partitions through HTTP.

Partitioner partition

Partitions space zoning is the key. Partition assignment of the key middle map output. "key or subset of keys of ice derive) used by the partition, typically A city hash function. The total number of partitions iced tea like the work of the reduce task. This hence meter, Johnson controls the reduce task's intermediate key and hence record) two post-glacial restores. Hashpartitioner is the default partition.

Counter counter

Counters are tools for MapReduce applications to report their statistics. Mapper and reducer implementations can use counters to report statistics. Hadoop's MapReduce comes with a universally useful mapper, a library of reducers, and plans.

In fact, MapReduce is talking about the concept of divide and conquer, dividing a complex task into several simple tasks to do respectively. In addition, it is the scheduling problem of the program, which tasks are given to which Mapper to deal with is a key consideration. The fundamental principle of MapReduce is the localization of information processing. Which PC holds the corresponding data to be processed, which PC is responsible for processing that part of the data, the significance of doing so is to reduce the burden of network communication. Finally, add a classic picture to make the final supplement. After all, charts are often more persuasive than words.

If the 400-gigabyte database is still there, it is divided into 400 tasks, each of which carries out about 1 g of data processing, which is 400 times faster in theory.

Please refer to google mapreduce for details.

Https://wenku.baidu.com/view/1aa777fd04a1b0717fd5dd4a.html

How MapReduce works

Let's use an example to understand this-

Suppose you have the following input data to the MapReduce program to count the number of words in the following data:

Welcome to Hadoop Class

Hadoop is good

Hadoop is bad

The final output of the MapReduce task is:

Bad

one

Class

one

Good

one

Hadoop

three

two

one

Welcome

one

These data go through the following stages

Enter split:

The input to MapReduce work is divided into fixed-size blocks called input splits, and the input discount is consumed by a single mapping.

Mapping-Mapping

This is the first phase of execution of the map-reduce program. Each segmented data in this phase is passed to the mapping function to produce the output value. In our example, the task of the mapping phase is to calculate the number of words per input split (more details about input segmentation are given below) and to compile a list in some form

Rearrange

This phase consumes the output of the mapping phase. Its task is to merge the relevant records output from the mapping phase. In our example, the same words and how often they appear.

Reducing

At this stage, a summary of values is output from the rearrangement phase. This phase combines the value from the rearranging phase and returns an output value. In short, this phase summarizes the complete dataset.

In our example, this stage summarizes the values from the rearrangement phase and calculates the sum of the number of occurrences of each word.

How does MapReduce organize its work?

Hadoop divides work into tasks. There are two types of tasks:

Map tasks (split and mapping) Reduce tasks (rescheduling, restoring)

As mentioned above

The complete execution process (executing Map and Reduce tasks) is controlled by two types of entities, called

Jobtracker: like a master (responsible for the full execution of the submitted job) multitasking tracker: acting as a role like a slave, each of which performs work

For each work submitted for execution in the system, there is a JobTracker residing in Namenode and Datanode residing in multiple TaskTracker.

The job is divided into multiple tasks and then run to multiple data nodes in the cluster. The responsibility of JobTracker is to coordinate activity scheduling tasks to run on different data nodes. The execution of a single task, which is then handled by TaskTracker, is part of the execution work, on each data node. The responsibility of TaskTracker is to send progress reports to JobTracker. In addition, TaskTracker periodically sends "heartbeat" signal information to JobTracker to inform the system of its current state. This allows JobTracker to track the overall progress of each piece of work. If a task fails, JobTracker can reschedule it on a different TaskTracker.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.