What's the use of MapReduce? 04/09 Update SLTechnology News&Howtos

What's the use of MapReduce?

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what's the use of MapReduce". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what's the use of MapReduce"?

1. What does MapReduce do?

Hadoop is actually the open source implementation of Google Sanbao, and Hadoop MapReduce corresponds to Google MapReduce,HBase and corresponds to BigTable,HDFS and GFS. HDFS (or GFS) provides efficient unstructured storage services for the upper layer. HBase (or BigTable) is a distributed database that provides structured data services. Hadoop MapReduce (or Google MapReduce) is a parallel computing programming model for job scheduling.

GFS and BigTable have provided us with high-performance, high-concurrency services, but parallel programming is not a job that all programmers can do. If our application itself can not be concurrent, then GFS and BigTable are meaningless. The great thing about MapReduce is that programmers who are not familiar with parallel programming can give full play to the power of distributed systems.

In a nutshell, MapReduce is a framework for dividing a large job into multiple small jobs (large and small jobs should be essentially the same, but of different sizes). What users need to do is to decide how many jobs to split and define the job itself.

Let's use an example that runs through the text to explain how MapReduce works.

two。 Example: statistical word frequency

If I want to count the most words that have appeared in computer papers in the past 10 years to see what everyone is studying, what should I do after I collect the papers?

Method 1: I can write a Mini Program, go through all the papers in order, count the number of words I encounter, and finally know which words are the hottest.

This method is very effective when the data set is small, and it is the simplest to implement, so it is appropriate to solve this problem.

Method 2: write a multithreaded program and send traversal papers.

In theory, this problem can be highly concurrent, because counting one file does not affect the statistics of another. When our machine is multi-core or multi-processor, method two is definitely more efficient than method one. But writing a multithreaded program is much more difficult than method one, and we have to synchronize and share data ourselves, for example, to prevent two threads from repeating statistical files.

Method 3: give the homework to multiple computers to complete.

We can use the program of method 1, deploy to N machines, and then divide the collection of essays into N parts, each machine running an assignment. This method runs fast enough, but it is troublesome to deploy, we have to manually copy the program to other machines, to manually separate the collection of essays, and the most painful thing is to integrate N running results (of course, we can also write another program).

Method 4: let MapReduce help us!

MapReduce is essentially method 3, but how to split the fileset, how to copy the program, and how to integrate the results are all defined by the framework. We just need to define this task (user program) and leave the rest to MapReduce.

Before introducing how MapReduce works, let's talk about the two core functions map and reduce, as well as the pseudocode of MapReduce.

3. Map function and reduce function

The map and reduce functions, which define the task itself, are left to the user to implement.

Map function: accepts a key-value pair (key-value pair) and produces a set of intermediate key-value pairs. The MapReduce framework passes the same value in the middle key value pair generated by the map function to a reduce function.

Reduce function: accepts a key and a related set of values, which are combined to produce a smaller set of values (usually only one or zero values).

The core code of the MapReduce function of statistical word frequency is very short, mainly to implement these two functions.

[plain] view plain copy

Print?

Map (String key, String value):

/ / key: document name

/ / value: document contents

For each word w in value:

EmitIntermediate (w, "1")

Reduce (String key, Iterator values):

/ / key: a word

/ / values: a list of counts

Int result = 0

For each v in values:

Result + = ParseInt (v)

Emit (AsString (result))

In the example of statistical word frequency, the key accepted by the map function is the file name, and the value is the content of the file. Map traverses the word one by one, and every time it encounters a word w, it produces an intermediate key-value pair, which means that we have found another word w. MapReduce passes key-value pairs with the same key (all the word w) to the reduce function, so that the reduce function accepts the word w, and the value is a string of "1" (the most basic implementation is this, but can be optimized), the number is equal to the number of key-value pairs with the key w, and then add these "1" to get the number of occurrence of the word w. Finally, the number of occurrences of these words is written to a user-defined location and stored in the underlying distributed storage system (GFS or HDFS).

4. How does MapReduce work

It all starts with the top user program, which links to the MapReduce library and implements the most basic Map and Reduce functions. The order of execution in the figure is marked with numbers.

The MapReduce library first divides the input file of user program into M (M is user-defined), each of which usually has 16MB to 64MB, which is divided into split0~4; as shown on the left of the figure, and then uses fork to copy the user process to other machines in the cluster.

One of the copies of the user program is called master, and the rest called worker,master is responsible for scheduling, assigning jobs (Map jobs or Reduce jobs) to idle worker, and the number of worker can also be specified by the user.

The worker assigned to the Map job begins to read the input data of the corresponding shards. The number of Map jobs is determined by M and corresponds to split one by one. The Map job extracts key-value pairs from the input data, and each key-value pair is passed to the map function as a parameter, and the intermediate key-value pairs generated by the map function are cached in memory.

The cached intermediate key-value pairs are regularly written to the local disk and are divided into R regions. The size of R is defined by the user, and each zone will correspond to a Reduce job in the future. The location of these intermediate key-value pairs will be notified to master,master to forward the information to Reduce worker.

Master tells the worker assigned to the Reduce job where the partition it is responsible for (there must be more than one place, and the intermediate key-value pairs generated by each Map job may be mapped to all R different partitions). When Reduce worker reads all the intermediate key-value pairs it is responsible for, it sorts them first so that the key-value pairs of the same key are clustered together. Because different keys may map to the same partition, that is, the same Reduce job (who makes the partitions less), sorting is necessary.

Reduce worker traverses the sorted intermediate key-value pair, and for each unique key, the key and associated value are passed to the reduce function, and the output generated by the reduce function is added to the output file of the partition.

When all the Map and Reduce jobs are completed, master wakes up the genuine user program,MapReduce function call to return the user program code.

After all execution, the MapReduce output is placed in the output file of R partitions (each corresponds to a Reduce job). Users usually do not need to merge the R files, but give them as input to another MapReduce program for processing. In the whole process, the input data comes from the underlying distributed file system (GFS), the intermediate data is placed on the local file system, and the final output data is written to the underlying distributed file system (GFS). And we should pay attention to the difference between the Map/Reduce job and the map/reduce function: the Map job handles a slice of input data and may need to call the map function multiple times to deal with each input key-value pair; the Reduce job handles the middle key-value pair of a partition, during which the reduce function is called for each different key, and the Reduce job finally corresponds to an output file.

I prefer to divide the process into three stages. The first stage is the preparation stage, including 1, 2, the protagonist is the MapReduce library, to complete the split jobs and copy user programs and other tasks; the second stage is the run phase, including 3, 4, 5, 6, the protagonist is the user-defined map and reduce functions, each small job is running independently; the third stage is the clean-up phase, when the job has been completed, the job results are put in the output file, depending on how the user wants to deal with these outputs.

5. How is the word frequency counted?

Combined with Section 4, we can see how the code in Section 3 works. Let's say we define Manners 5 and Renewal 3, and there are six machines, one master.

This picture describes how MapReduce handles word frequency statistics. Because the number of map worker is not enough, fragments 1, 3, 4 are processed first, and intermediate key-value pairs are generated; when all the intermediate values are ready, the Reduce job starts to read the corresponding partitions and output the statistical results.

6. User's rights

The main task of the user is to implement the map and reduce interfaces, but there are some useful interfaces that are open to users.

An input reader . This function divides the input into M parts and defines how to extract the initial key-value pairs from the data, such as the word frequency example, which defines that the file name and file content are key-value pairs.

A partition function . This function is used to map the intermediate key-value pair generated by the map function to a partition. The simplest implementation is to hash the key and then take the R module.

A compare function . This function is used for Reduce job sorting, which defines the size relationship of the key.

An output writer . Responsible for writing the results to the underlying distributed file system.

A combiner function . It is actually the reduce function, which is used for the optimization mentioned above, such as counting word frequency, if each is read once, because reduce and map are usually not on the same machine, which is a waste of time, so you can run combiner once where map executes, so reduce only needs to be read once.

I won't say much about the map and reduce functions.

7. Implementation of MapReduce

At present, there are many implementations of MapReduce, in addition to Google's own implementation, there is also the famous hadoop, the difference is that Google is a caterpillar, while hadoop uses java. In addition, Stanford University has implemented a MapReduce running in a multi-core / multiprocessor, shared memory environment, called Phoenix (introduction).

At this point, I believe you have a deeper understanding of "what is the use of MapReduce"? you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.