How to deal with the problem of data skew with mapreduce 07/13 Update SLTechnology News&Howtos

How to deal with the problem of data skew with mapreduce

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "how to use mapreduce to deal with data tilting problem", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical.

When the map / reduce program is executed, most of the reduce nodes are finished, but there is one or more reduce

The node runs slowly, which results in a long processing time of the whole program. This is because one key has more entries than other key.

Many (sometimes a hundred or a thousand times), the reduce node where this key is located processes more data than other sections

The point is much larger, resulting in some nodes running late, which is called data tilt.

Solution:

(1) set a number of hash copies N to break up a large number of key.

(2) process the data with multiple duplicate key: from 1 to N, add the number after the key as the new key

If you need to associate with another piece of data, override the comparison class and the distribution class. In this way, the average distribution of multiple key is realized.

If you need to associate with other data

To ensure that there is an associated key on each reduce node, another piece of data from a single key is processed: circular

Add the number after key from 1 to N as the new key

The amount of data in reduce shuffle will become so huge that the loss outweighs the gain, so it is impossible to solve the problem of slow running time.

problem。

Find common ground in two pieces of data, for example, there are other fields with the same meaning in addition to the associated fields in the two pieces of data.

Is a number, which can be used to model the number of copies of hash. If it is a character, you can use hashcode to model the number of copies of hash (of course).

Word to avoid too much data falling on the same reduce, you can also use hashcode), so that if this field

If the value distribution is evenly enough, the above problem can be solved.

Solution: 1. Increase the jvm memory of reduce by 2. Increase the number of reduce

At this point, I believe you have a deeper understanding of "how to use mapreduce to deal with data tilting". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.