Principles of MAPREDUCE (2) 07/03 Update SLTechnology News&Howtos

Principles of MAPREDUCE (2)

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

3.1 shuffle mechanism for mapreduce 3.1.1 Overview:

In v mapreduce, how the data processed by the map phase is transferred to the reduce phase is the most critical process in the mapreduce framework, which is called shuffle.

V shuffle: shuffle, deal-(core mechanism: data partitioning, sorting, caching)

V specifically: the processing result data output by maptask is distributed to reducetask, and in the process of distribution, the data is partitioned and sorted by key.

3.1.2 main processes:

Shuffle caching process:

Shuffle is a process in the MR processing flow. Each processing step is completed on each map task and reduce task node. As a whole, it is divided into three operations:

1. Partition partition

2. Sort sorts according to key

3. Combiner merges local value

3.1.3 detailed process

1. Maptask collects the kv pairs output by our map () method and puts them in the memory buffer

2. Keep overflowing local disk files from memory buffers, which may overflow multiple files.

3. Multiple overflow files will be merged into large overflow files

4. In the process of overflow and merge, partitoner is called to group and sort key.

5. Reducetask goes to each maptask machine to get the corresponding result partition data according to its own partition number.

6. Reducetask will get the result files from different maptask of the same partition, and reducetask will merge (merge and sort) these files.

7. After merging into a large file, the process of shuffle is over, followed by the logical operation of reducetask (taking a key-value pair group from the file and calling the user-defined reduce () method)

The buffer size in Shuffle will affect the execution efficiency of mapreduce programs. In principle, the larger the buffer, the fewer disk io times and the faster the execution speed.

The size of the buffer can be adjusted by parameters: io.sort.mb defaults to 100m.

3.1.4 detailed process diagram

3.2. Overview of Serialization 3.2.1 in MAPREDUCE

Java serialization is a heavyweight serialization framework (Serializable). After an object is serialized, it will be accompanied by a lot of additional information (various verification information, header, inheritance system. ), which is not convenient for efficient transmission in the network

Therefore, hadoop has developed its own serialization mechanism (Writable), which is concise and efficient.

3.2.2 comparison between Jdk serialization and MR serialization

Simple code verifies the difference between the two serialization mechanisms:

Public class TestSeri {

Public static void main (String [] args) throws Exception {

/ / define two ByteArrayOutputStream to receive serialization results from different serialization mechanisms

ByteArrayOutputStream ba = new ByteArrayOutputStream ()

ByteArrayOutputStream ba2 = new ByteArrayOutputStream ()

/ / define two DataOutputStream for jdk standard serialization of ordinary objects

DataOutputStream dout = new DataOutputStream (ba)

DataOutputStream dout2 = new DataOutputStream (ba2)

ObjectOutputStream obout = new ObjectOutputStream (dout2)

/ / define two bean as serialized source objects

ItemBeanSer itemBeanSer = new ItemBeanSer (1000L, 89.9f)

ItemBean itemBean = new ItemBean (1000L, 89.9f)

/ / used to compare serialization differences between String types and Text types

Text atext = new Text ("a")

/ / atext.write (dout)

ItemBean.write (dout)

Byte [] byteArray = ba.toByteArray ()

/ / compare serialization results

System.out.println (byteArray.length)

For (byte b: byteArray) {

System.out.print (b)

System.out.print (:)

}

System.out.println ("- -")

String astr = "a"

/ / dout2.writeUTF (astr)

Obout.writeObject (itemBeanSer)

Byte [] byteArray2 = ba2.toByteArray ()

System.out.println (byteArray2.length)

For (byte b: byteArray2) {

System.out.print (b)

System.out.print (:)

}

3.2.3 Custom objects implement serialization interfaces in MR

If you need to transfer the custom bean in key, you also need to implement the comparable interface, because the shuffle process in the mapreduce box must sort the key. In this case, the interface of the custom bean implementation should be:

Public class FlowBean implements WritableComparable

The methods that need to be implemented by yourself are:

/ * *

* the method of deserialization. When deserializing, the order of the fields read from the stream should be the same as the order written out during serialization.

, /

@ Override

Public void readFields (DataInput in) throws IOException {

Upflow = in.readLong ()

Dflow = in.readLong ()

Sumflow = in.readLong ()

}

/ * *

* serialization method

, /

@ Override

Public void write (DataOutput out) throws IOException {

Out.writeLong (upflow)

Out.writeLong (dflow)

/ / you can consider not serializing the total traffic, because the total traffic can be calculated by uplink and downstream traffic

Out.writeLong (sumflow)

}

@ Override

Public int compareTo (FlowBean o) {

/ / implement sorting in reverse order according to the size of sumflow

Return sumflow > o.getSumflow ()?-1:1

}

3.3. Overview of MapReduce and YARN3.3.1 YARN

Yarn is a resource scheduling platform, responsible for providing server computing resources for computing programs, which is equivalent to a distributed operating system platform, while computing programs such as mapreduce are equivalent to applications running on the operating system.

3.3.2 important concepts of YARN

1. Yarn is not clear about the running mechanism of the programs submitted by users.

2. Yarn only provides scheduling of computing resources (when user programs apply for resources from yarn, yarn is responsible for allocating resources)

3. The supervisor role in yarn is called ResourceManager.

4. The role that provides computing resources in yarn is called NodeManager.

5. In this way, yarn is actually completely decoupled from the running user programs, which means that various types of distributed computing programs can be run on yarn (mapreduce is just one of them), such as mapreduce, storm, spark, tez.

6. Therefore, computing frameworks such as spark and storm can be integrated and run on yarn, as long as their respective frameworks have resource request mechanisms that conform to the yarn specification.

7. Yarn has become a general resource scheduling platform. From then on, all kinds of computing clusters that previously existed in enterprises can be integrated into one physical cluster to improve resource utilization and facilitate data sharing.

3.3.3 example of running an operation program in Yarn

The scheduling process of mapreduce program, as shown in the following figure

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.