What does the MapReduce computing framework mean? 07/15 Update SLTechnology News&Howtos

What does the MapReduce computing framework mean?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you what the MapReduce computing framework refers to. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

What is 1 MapReduce?

Mr. said: "MapReduce is a computing framework of hadoop, to put it bluntly, hdfs is responsible for storage, then things like other statistics and calculations will be left to MapReduce to do, divided into map process and reduce process."

The Map process is dismantling, for example, there is a red car, there is a group of workers, take it apart into parts, this is Map. "

The Reduce process is a combination, we have a lot of car parts, there are a lot of other device parts, put them together, into Transformers, this is Reduce.

Xiaobai listened: "so it sounds and feels very vivid, so how is the specific map process and reduce process?"

"Let me explain it to you slowly," Mr. Meow swallowed.

2 the principle of MapReduce

The teacher said, "first of all, let's look at the following data-the student's information record sheet."

1. We can filter out the data with gender 1.

2. You can convert 1 in the gender field to male and 0 to female.

3. You can also expand the field address

The above process is map: mapping on a record basis (filter / transform / expand)

Xiaobai said: "I feel that the principle of map is similar to the syntax of mysql, ah, select * from student where sex=1, is to deal with data one by one."

Sir: "well, children can be taught. Let's continue to look at the reduce process: when we want to count the number of students studying each major, we need to group python, java and c into groups and make statistical calculations in such a group."

The above process is reduce: calculate on a group basis.

Xiaobai said: "this is not the principle of group by in mysql, statistics are carried out in groups."

The husband added: "the thought is similar to mysql's group by thought."

Finally, Mr. Meow went on to conclude: "the input data is mapped according to a piece of data (map method), and then output kv key-value pairs, calculated in groups as input to reduce, and finally output the result."

The studious rookie continued to ask, "well, I understand the general process of mapreduce. How does it get data from hdfs and how does it interact with each other?"

Sir: "not bad, Xiaobai. It seems that you are quite progressive. Let's take a look at the mapreduce interaction diagram. Mapreduce is divided into four steps."

It is divided into four steps:

The map task fetches data on the hdfs through split, and a split corresponds to a map method and outputs data in key,value,partition format. The map task puts the extracted data into memory, partitions and groups to sort. The reduce task now knows the partition where the key is located, pulls the data on the corresponding file partition (dfs), and calculates the final output data.

"Why doesn't map get data directly from hdfs and have to use split in the middle?" Xiao Bai scratched his head and looked at Mr. Meow.

The teacher nodded to Xiaobai and said, "this is a great question. The default size of split is equal to the size of a block block on hdfs is about 64m, but you can adjust the size of split to deal with different computing types."

When we run CPU-bound (compute intensive), we can set the split to be smaller, with multiple split corresponding to 1 block block, which can improve the computing speed.

When we run IO-bound (IO-intensive), we can set the split a little larger, with one split corresponding to N block blocks, which can improve the reading and writing efficiency of IO.

CPU-bound (compute intensive):

Suppose there is a math problem with only one line of words

It takes 1 second to read the problem, and it takes 1 month to solve it.

This is CPU-bound. (CPU utilization is almost 100%).

IO bound (IO intensive):

Suppose there is a math problem that is as thick as a historical record.

It takes 2 months to finish reading, but the question only asks you to answer 1 "1"?

This is I/O-bound. (CPU IDLE status).

The rookie concluded: "split can originally control the parallelism of map, which determines how many map tasks are enabled. One split to one map method outputs a pair of KJVJ p keys."

"Why do you put the output kv key-value pairs in memory? although the memory speed is 100000 times that of the hard disk, won't the data be written to disk eventually? isn't that tantamount to taking off your pants and farting?" A problem that Xiaobai is worried about.

"well, the words are not rough. Here we put the kv key values of map output in 100m of memory, and another important thing has been done-that is to sort the data of k _ ~ ~ v _ p, put the data under partition p together, and sort the k under the same partition, so that the following reduce is convenient for merging and sorting." The gentleman explained.

"slow down. I'm confused. Give me an example."

"Let's take a look at the following example, counting the number of java\ python\ mysql" Mr. quickly drew a picture.

Imput phase: java, python, mysql storage file locations are stored on the block block of hdfs

Split phase: use split to slice files on hdfs, where java\ python\ mysql information is stored on file partitions 0,2,3,15,16,17,205

Map phase: output the data containing java\ python\ mysql information on each partition with kvp key-value pairs. For example, java,1,0 represents the storage of java information once under partition 0.

Shuffle phase: sort the same set of data in memory, for example: java appears on 0,3,15,205 partitions

Reduce phase: the final reduce task gets the corresponding files from the specified file partition according to the order of the output of the shffle phase.

"it's amazing, it seems that sorting in memory is really important, effectively reducing the number of file reads, reading multiple times at a time, and the corresponding processing speed is also accelerated." Xiaobai suddenly realized.

"there is another question. I think the number of key in the above example is 3 (java\ python\ mysql). The number of tasks corresponding to reduce is also 3. Is the number of key equal to the number of reduce?" Xiaobai asked

"I have observed very carefully that the number of reduce is controlled by the programmer's code, but the number of key is not exactly equal to the number of reduce. What if there are 100000 key? does the number of reduce need 100000? there are certainly not that many resources, so it is generally based on the number of reduce executors in specific server resources." Monsieur added.

"in addition, it should be noted that if the data volume of key is not evenly distributed, the problem of data skew may occur. If there are two key----1 men and one woman, the amount of data for men is 10T, and the data for women is only 1G. In this way, when the system follows reduce to process the same key, the same key will be assigned to the same reduce actuator, so that a reduce actuator will process 10T of data. Another reduce actuator processes 1G data, which becomes a data tilt. " Monsieur continued to add.

The above is what the MapReduce computing framework shared by the editor refers to. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.