Brief introduction of several schemes of two tables join in MapReduce 07/03 Update SLTechnology News&Howtos

Brief introduction of several schemes of two tables join in MapReduce

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. overview

JOIN operations are common and time-consuming in traditional databases such as MYSQL. JOIN operations in Hadoop are also common and time-consuming, but due to Hadoop's unique design philosophy, there are some special tricks when performing JOIN operations.

This article first introduces common JOIN implementations on Hadoop and then presents several optimizations for different input datasets.

2. Common join methods

Suppose the data to be joined comes from File1 and File 2.

2.1 reduce side join

Reduce side join is the simplest way to join, and its main idea is as follows:

In the map phase, the map function reads two files File1 and File2 at the same time. In order to distinguish the key/value data pairs from the two sources, a tag is placed on each data, for example: tag=0 means from File1, tag=2 means from File2. That is, the main task of the map phase is to label the data in different files.

In the reduce phase, the reduce function takes the list of values from File1 and File2 with the same key, and then joins the data in File1 and File2 for the same key. That is, the reduce phase performs the actual join operation.

Hadoop join reduces side join

http://blog.csdn.net/huashetianzu/article/details/7819244

2.2 map side join

The reason for the existence of reduce side join is that all the required join fields cannot be obtained in the map phase, that is, the fields corresponding to the same key may be located in different maps. Reduce side joins are very inefficient because the shuffle phase involves a lot of data transfer.

Map side join is optimized for situations where one of the two tables to be joined is so large and the other is so small that the small table can be stored directly in memory. In this way, we can copy multiple copies of the small table, so that there is one copy in each map task memory (for example, stored in the hash table), and then only scan the large table: for each record key/value in the large table, look up whether there is a record with the same key in the hash table, and if there is, connect and output it.

To support file replication, Hadoop provides a DistributedCache class that can be used as follows:

(1) The user specifies the file to copy using the static method DistributedCache.addCacheFile(), whose parameter is the URI of the file (if it is a file on HDFS, it can be like this: hdfs://namenode:9000/home/XXX/file, where 9000 is the NameNode port number configured by yourself). JobTracker fetches this URI list before the job starts and copies the corresponding files to the local disk of each TaskTracker. (2) The user obtains the file directory using the DistributedCache.getLocalCacheFiles() method and reads the corresponding file using the standard file read-write API.

REF: map of hadoop join side join

http://blog.csdn.net/huashetianzu/article/details/7821674

2.3 Semi Join

Semi Join, also known as semi-join, is a method borrowed from distributed databases. Its motivation is: for reduce side join, the amount of data transmitted across machines is very large, which becomes a bottleneck of join operation. If you can filter out the data that will not participate in the join operation at the map end, you can greatly save network IO.

The implementation method is very simple: select a small table, assuming it is File1, extract the key of its participation in the join, and save it to File3. File3 files are generally small and can be placed in memory. In the map phase, use DistributedCache to copy File3 to each TaskTracker, and then filter out the records corresponding to the keys in File2 that are not in File3. The rest of the reduce phase works the same as the reduce side join.

For more information on semi-connection, please refer to: Semi-connection Introduction: wenku.baidu.com/view/ae7442db7f1922791688e877.html

REF: Semi join of hadoop join

http://blog.csdn.net/huashetianzu/article/details/7823326

2.4 reduce side join + BloomFilter

In some cases, the key set of the small table extracted by SemiJoin is still stored in memory, and BloomFiler can be used to save space.

BloomFilter is used to determine whether an element is in a collection. Its two most important methods are: add() and contains(). The biggest feature is that there is no false negative, that is, if contains() returns false, then the element must not be in the collection, but there is a certain false positive, that is, if contains() returns true, then the element must be in the collection.

Therefore, the key in the small table can be saved to BloomFilter, and the large table can be filtered in the map phase. Some records that are not in the small table may not be filtered (but the records in the small table will not be filtered). It does not matter, but only a small amount of network IO is added.

For more information about BloomFilter, please refer to blog.csdn.net/jiaomeng/article/details/1495500

3. quadratic sorting

In Hadoop, sorting by key is the default, but what if sorting by value? That is, for the same key, the value list received by the reduce function is sorted by value. This kind of application requirement is very common in join operations, for example, you want the value corresponding to the small table to be ranked first in the same key.

There are two ways to sort twice: buffer and in memory sort and value-to-key conversion.

For buffer and in memory sort, the main idea is: in the reduce() function, save all the values corresponding to a certain key, and then sort them. The biggest disadvantage of this method is that it may cause out of memory.

For value-to-key conversion, the main idea is to concatenate key and partial value into a combined key (implement WriteableComparable interface or call setSortComparatorClass function), so that the result obtained by reduce is sorted by key first and then sorted by value. It should be noted that users need to implement Partioner themselves in order to divide data only by key. Hadoop explicitly supports secondary sorting. There is a setGroupingComparatorClass() method in the Configuration class, which can be used to set the key value of the sorting group. For details, see www.cnblogs.com/xuxm2007/archive/2011/09/03/2165805.html

4. PostScript

Recently, I have been looking for a job. Since I am familiar with Hadoop on my resume, almost every interviewer will ask something related to Hadoop, and the implementation of Join on Hadoop has become a must-ask question, and very few companies will also involve the principle of DistributedCache and how to use DistributedCache for Join operations. In order to better deal with these interviewers, this article is specially organized.

5. resources

Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, University of Maryland, College Park

Hadoop In Action pages 107 - 131

(3) Secondary Sort of mapreduce: www.cnblogs.com/xuxm2007/archive/2011/09/03/2165805.html

(4) Semi-connection introduction: wenku.baidu.com/view/ae7442db7f1922791688e877.html

(5) BloomFilter Introduction: blog.csdn.net/jiaomeng/article/details/1495500

(6) This article comes from: dongxicheng.org/mapreduce/hadoop-join-two-tables/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.