What is the principle of Spark Join 04/28 Update SLTechnology News&Howtos

What is the principle of Spark Join

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what the principle of Spark Join is, and the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

It is a common scenario to Join two datasets in data analysis. In the physical planning phase of Spark, the Join Selection class of Spark will select the final Join policy based on the Join hints policy, the size of the Join table, whether the Join is equivalent or not, and whether the key participating in the Join can be sorted, and finally the Spark will use the selected Join policy to perform the final calculation. Currently, Spark supports a total of five Join policies:

Broadcast hash join (BHJ)

Shuffle hash join (SHJ)

Shuffle sort merge join (SMJ)

Shuffle-and-replicate nested loop join, also known as Cartesian product (Cartesian product join)

Broadcast nested loop join (BNLJ)

The two Join strategies, BHJ and SMJ, are the most common ones for us to run Spark jobs. JoinSelection will first select one of Broadcast hash join, Shuffle hash join and Shuffle sort merge join based on the Key of Join as the equivalent Join; if the Key of Join is not equivalent Join or no Join condition is specified, Broadcast nested loop join or Shuffle-and-replicate nested loop join will be selected. The efficiency of different Join strategies varies greatly, so it is necessary to understand the implementation process and applicable conditions of each Join strategy.

1 、 Broadcast Hash Join

The implementation of Broadcast Hash Join is to broadcast the data of a small table to all Executor ends of Spark. This broadcasting process is no different from that of broadcasting data by ourselves:

Use the collect operator to pull the data of the small table from the Executor side to the Driver side, call sparkContext.broadcast on the Driver side to broadcast to all Executor sides, and use the broadcast data on the Executor side to perform the Join operation with the large table (actually performing the map operation)

This Join strategy avoids Shuffle operations. Generally speaking, Broadcast Hash Join executes faster than other Join policies.

To use this Join strategy, the following conditions must be met: the data of the small table must be very small and can be configured through the spark.sql.autoBroadcastJoinThreshold parameter. The default is 10MB. If the memory is large, you can set the threshold value to-1 and set the spark.sql.autoBroadcastJoinThreshold parameter to-1. This connection mode can only be used for equivalent Join, and the keys participating in the Join is not required to be sortable.

2 、 Shuffle Hash Join

When the data in the table is large and not suitable for broadcasting, you can consider using Shuffle Hash Join. Shuffle Hash Join is also a strategy of choice when Join is performed on large and small tables. Its calculation idea is: partition the large table and the small table according to the same partition algorithm and the number of partitions (partition according to the keys participating in Join), so as to ensure that the data with the same hash value is distributed to the same partition, and then the partitions with the same hash value of two tables in the same Executor can be hash Join locally. Prior to Join, Hash Map is also built on the partitions of the small table. Shuffle hash join uses the idea of divide and conquer to break down big problems into small ones to solve them.

To enable Shuffle Hash Join, the following conditions must be met: only equivalent Join is supported, and the Keys sortable spark.sql.join.preferSortMergeJoin parameter participating in Join must be set to false, which is introduced from Spark version 2.0.0, and the default value is true. That is, by default, the size of the Sort Merge Join small table (plan.stats.sizeInBytes) must be less than spark.sql.autoBroadcastJoinThreshold * spark.sql.shuffle.partitions (the default is 200) and three times the size of the small table (stats.sizeInBytes) must be less than or equal to the size of the large table (stats.sizeInBytes), that is, a.stats.sizeInBytes * 3

< = b.stats.sizeInBytes 3、Shuffle Sort Merge Join 前面两种 Join 策略对表的大小都有条件的，如果参与 Join 的表都很大，这时候就得考虑用 Shuffle Sort Merge Join 了。 Shuffle Sort Merge Join 的实现思想：将两张表按照 join key 进行shuffle，保证join key值相同的记录会被分在相应的分区对每个分区内的数据进行排序排序后再对相应的分区内的记录进行连接无论分区有多大，Sort Merge Join都不用把一侧的数据全部加载到内存中，而是即用即丢；因为两个序列都有序。从头遍历，碰到key相同的就输出，如果不同，左边小就继续取左边，反之取右边。从而大大提高了大数据量下sql join 的稳定性。

To enable Shuffle Sort Merge Join, the following conditions must be met:

Only equivalent Join is supported, and Keys participating in Join is required to be sortable

4 、 Cartesian product join

If the two tables participating in Join in Spark do not specify the join conditions, then Cartesian product join will be generated, and the result of this Join is actually

Is the product of the number of rows of two tables.

5 、 Broadcast nested loop join

You can think of the execution of Broadcast nested loop join as the following calculation:

For record_1 in relation_1:

For record_2 in relation_2:

Join condition is executed

It can be seen that Broadcast nested loop join will scan a table repeatedly in some cases, which is very inefficient. As can be seen from the name, this kind of

Join broadcasts the small table according to the relevant conditions to reduce the number of table scans.

Broadcast nested loop join supports equivalent and unequivalent Join, and supports all Join types.

About what the principle of Spark Join is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.