Hash join (Hash Join) of what's new in MySQL 8.0 04/19 Update SLTechnology News&Howtos

Hash join (Hash Join) of what's new in MySQL 8.0

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The MySQL development team officially released MySQL 8.0.18 GA on October 14, 2019, with a number of new features and enhancements. The most striking of these is the support of hash join for multi-table join queries. Let's take a look at the official description:

MySQL implements the hash join mode for inner join queries. For example, starting with MySQL 8.0.18, the following query can use hash join for join queries:

SELECT * FROM T1 JOIN T2 ON t1.c1=t2.c1

Hash join does not require index support. In most cases, hash join is more efficient than the previous Block Nested-Loop algorithm for equivalent joins without indexes. Create three test tables using the following statement:

CREATE TABLE T1 (C1 INT, c2 INT); CREATE TABLE T2 (C1 INT, c2 INT); CREATE TABLE T3 (C1 INT, c2 INT)

Use the EXPLAIN FORMAT=TREE command to see the hash join in the execution plan, for example:

Mysql > EXPLAIN FORMAT=TREE-> SELECT *-> FROM T1-> JOIN T2-> ON t1.c1=t2.c1\ gateway * 1. Row * * EXPLAIN:-> Inner hash join (t2.c1 = t1.c1) (cost=0.70 rows=1)-> Table scan On T2 (cost=0.35 rows=1)-> Hash-> Table scan on T1 (cost=0.35 rows=1)

You must use the FORMAT=TREE option of the EXPLAIN command to see the hash join in the node. In addition, the EXPLAIN ANALYZE command can also display the usage information of hash join. This is also a new feature in this version.

Queries that use equivalent joins between multiple tables also do this optimization. For example, the following query:

SELECT * FROM T1 JOIN T2 ON (t1.c1 = t2.c1 AND t1.c2

< t2.c2) JOIN t3 ON (t2.c1 = t3.c1); 在以上示例中，任何其他非等值连接的条件将会在连接操作之后作为过滤器使用。可以通过EXPLAIN FORMAT=TREE命令的输出进行查看： mysql>

EXPLAIN FORMAT=TREE-> SELECT *-> FROM T1-> JOIN T2-> ON (t1.c1 = t2.c1 AND t1.c2

< t2.c2) ->

JOIN T3-> ON (t2.c1 = t3.c1)\ row * 1. Row * * EXPLAIN:-> Inner hash join (t3.c1 = t1.c1) (cost=1.05 rows=1)-> Table scan on T3 (cost=0.35 rows=1)-> Hash-> Filter: (t1.c2

< t2.c2) (cost=0.70 rows=1) ->

Inner hash join (t2.c1 = t1.c1) (cost=0.70 rows=1)-> Table scan on T2 (cost=0.35 rows=1)-> Hash-> Table scan on T1 (cost=0.35 rows=1)

As you can see from the above output, queries with multiple equivalent join conditions can (will) use multiple hash join joins.

However, if the equivalent join condition is not used in any join statement (ON), the hash join connection method will not be used. For example:

Mysql > EXPLAIN FORMAT=TREE-> SELECT *-> FROM T1-> JOIN T2-> ON (t1.c1 = t2.c1)-> JOIN T3-> ON (t2.c1)

< t3.c1)\G*************************** 1. row ***************************EXPLAIN: 此时，将会采用性能更慢的 block nested loop 连接算法。这与 MySQL 8.0.18 之前版本中没有索引时的情况一样： mysql>

EXPLAIN-> SELECT *-> FROM T1-> JOIN T2-> ON (t1.c1 = t2.c1)-> JOIN T3-> ON (t2.c1)

< t3.c1)\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 partitions: NULL type: ALLpossible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1 filtered: 100.00 Extra: NULL*************************** 2. row *************************** id: 1 select_type: SIMPLE table: t2 partitions: NULL type: ALLpossible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1 filtered: 100.00 Extra: Using where; Using join buffer (Block Nested Loop)*************************** 3. row *************************** id: 1 select_type: SIMPLE table: t3 partitions: NULL type: ALLpossible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1 filtered: 100.00 Extra: Using where; Using join buffer (Block Nested Loop) Hash join 连接同样适用于不指定查询条件时的笛卡尔积（Cartesian product），例如： mysql>

EXPLAIN FORMAT=TREE-> SELECT *-> FROM T1-> JOIN T2-> WHERE t1.c2 > 50\ gateway * 1. Row * * EXPLAIN:-> Inner hash join (cost=0.70 rows=1)-> Table scan on T2 (cost=0.35 rows=1)-> Hash -> Filter: (t1.c2 > 50) (cost=0.35 rows=1)-> Table scan on T1 (cost=0.35 rows=1)

When configured by default, MySQL uses hash join whenever possible. It also provides two ways to control whether or not to use hash join:

Set the hash_join=on or hash_join=off option in the server system variable optimizer_switch at the global or session level. The default is hash_join=on.

Specifies that the optimizer prompts HASH_JOIN or NO_HASH_JOIN for a specific connection at the statement level.

The amount of memory allowed by hash join can be controlled through the system variable join_buffer_size; hash join does not use more than the amount set by this variable. If the memory required by hash join exceeds this threshold, MySQL will perform the operation on disk. It is important to note that if hash join cannot be completed in memory and the number of files opened exceeds the value of the system variable open_files_limit, the connection operation may fail. To solve this problem, you can use one of the following methods:

Increase the value of join_buffer_size to ensure that hash join can be done in memory.

Increase the value of open_files_limit.

Next, they compare the performance of hash join and block nested loop, first generating 1000000 records for T1, T2, and T3, respectively:

Set join_buffer_size=2097152000;SET @ @ cte_max_recursion_depth = 99999999 FROM t WHERE t.c1 insert INTO T1 FROM t WHERE t.c1-INSERT INTO T2 muri-INSERT INTO t3WITH RECURSIVE t AS (SELECT 1 AS C1, 1 AS c2 UNION ALL SELECT t.c1 + 1, t.c1 * 2 FROM t WHERE t.c1

< 1000000)SELECT * FROM t; 没有索引情况下的 hash join： mysql>

EXPLAIN ANALYZE-> SELECT COUNT (*)-> FROM T1-> JOIN T2-> ON (t1.c1 = t2.c1)-> JOIN T3-> ON (t2.c1 = t3.c1)\ ON * 1. Row * * EXPLAIN:-> Aggregate: count (0) (actual time=22993.098..22993.099 rows=1 loops=1)-> Inner hash join (t3.c1 = t1.c1) (cost=9952535443663536.00 rows=9952435908880402) (actual time=14489.176..21737.032 rows=1000000 loops=1)-> Table scan on T3 (cost=0.00 rows=998412) (actual time=0.103..3973.892 rows=1000000 loops=1)-> Hash-> Inner hash join (t2.c1 = t1.c1) (cost=99682753413.67 rows=99682653660) (actual time=5663.592..12236. 984 rows=1000000 loops=1)-> Table scan on T2 (cost=0.01 rows=998412) (actual time=0.067..3364.105 rows=1000000 loops=1)-> Hash-> Table scan on T1 (cost=100539.40 rows=998412) (actual time=0.133..3395.799 rows=1000000 loops=1) 1 row in set (23.22 sec) mysql > SELECT COUNT (*)-> FROM T1-> JOIN T2-> ON (t1.c1 = t2.c1)-> JOIN T3-> ON (t2.c1 = t3.c1) +-+ | COUNT (*) | +-+ | 1000000 | +-+ 1 row in set (12.98 sec)

The actual operation took 12.98 seconds. If you use block nested loop at this time:

Mysql > EXPLAIN FORMAT=TREE-> SELECT / * + NO_HASH_JOIN (T1, T2) T3) * / COUNT (*)-> FROM T1-> JOIN T2-> ON (t1.c1 = t2.c1)-> JOIN T3-> ON (t2.c1 = t3.c1)\ t2.c1 * 1. Row * * EXPLAIN: 1 row in set (0. 00 sec) SELECT / * + NO_HASH_JOIN (T1 T2, T3) * / COUNT (*) FROM T1 JOIN T2 ON (t1.c1 = t2.c1) JOIN T3 ON (t2.c1 = t3.c1)

EXPLAIN shows that hash join cannot be used. The query ran for dozens of minutes without results, and one of the CPU utilization reached 100%, because the nested loop (to the third power of 1000000) has been executed.

Take a look at the block nested loop method when there is an index, and add the index:

Mysql > CREATE index idx1 ON T1 (C1); Query OK, 0 rows affected (7.39 sec) Records: 0 Duplicates: 0 Warnings: 0mysql > CREATE index idx2 ON T2 (C1); Query OK, 0 rows affected (6.77 sec) Records: 0 Duplicates: 0 Warnings: 0mysql > CREATE index idx3 ON T3 (C1); Query OK, 0 rows affected (7.23 sec) Records: 0 Duplicates: 0 Warnings: 0

View the execution plan and run the same query statement:

Mysql > EXPLAIN ANALYZE-> SELECT COUNT (*)-> FROM T1-> JOIN T2-> ON (t1.c1 = t2.c1)-> JOIN T3-> ON (t2.c1 = t3.c1)\ ON * 1. Row * * EXPLAIN:-> Aggregate: Count (0) (actual time=47684.034..47684.035 rows=1 loops=1)-> Nested loop inner join (cost=2295573.22 rows=998412) (actual time=0.116..46363.599 rows=1000000 loops=1)-> Nested loop inner join (cost=1198056.31 rows=998412) (actual time=0.087..25788.696 rows=1000000 loops=1)-> Filter: (t1.c1 is not null) (cost=100539.40 rows=998412) (actual time=0.050..5557.847 rows=1000000 loops=1)-> Index scan on T1 Using idx1 (cost=100539.40 rows=998412) (actual time=0.043..3253.769 rows=1000000 loops=1)-> Index lookup on T2 using idx2 (c1=t1.c1) (cost=1.00 rows=1) (actual time=0.012..0.015 rows=1 loops=1000000)-> Index lookup on T3 using idx3 (c1=t1.c1) (cost=1.00 rows=1) (actual time=0.012..0.015 rows=1 loops=1000000) 1 row in set (47.68sec) mysql > SELECT COUNT (*)-> FROM T1-> JOIN T2-> ON (t1.c1 = t2.c1)-> JOIN T3-> ON (t2.c1 = t3.c1) +-+ | COUNT (*) | +-+ | 1000000 | +-+ 1 row in set (19.56 sec)

The actual operation took 19.56 seconds. So the test results in our scenario are as follows:

Hash Join (no index) Block Nested Loop (no index) Block Nested Loop (with index) 12.98 s did not return 19.56 s

Add another hash join result without index in Oracle 12c: 1.282 s.

Add another hash join result without index in PostgreSQL 11.5: 6.234 s.

Add another hash join result without index in SQL 2017: 5.207 s.

Summary

The above is the hash connection (Hash Join) of the new features of MySQL 8.0 introduced by the editor. I hope it will be helpful to you. If you have any questions, please leave me a message and the editor will reply to you in time. Thank you very much for your support to the website! If you think this article is helpful to you, you are welcome to reprint it, please indicate the source, thank you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.