What is the database hash connection in MySQL? 07/02 Update SLTechnology News&Howtos

What is the database hash connection in MySQL?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

What is the database hash connection in MySQL? I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Overview

For a long time, the only algorithm for MySQL to perform join is a variant of the nested Loop algorithm (nested loop algorithm), but the nested Loop algorithm is very inefficient in some scenarios, which is also a problem that MySQL has been criticized.

With the release of MySQL 8.0.18, MySQL Server can use hash joins (hash join). This article will briefly introduce how hash joins are implemented, see how it works in MySQL, when to use it, and what are the limitations.

Recommended study: MySQL tutorial

Introduction to hash connection

What is a hash connection?

Hash join is a join algorithm used in relational databases, which can only be used in connections with equal connection conditions (on a.b = c.b). It is usually more efficient than nested loop algorithms (except that the probe is very small), especially if the index is not hit.

To put it simply, the hash join algorithm first loads a small table into the memory hash table, then traverses the data of the large table, goes row by row to match the qualified data in the hash table, and returns it to the client.

(the hash table is just an example. The aspect understands that the key of the actual hash is the value of the join, and the value is the linked table of data rows.)

Hash joins are usually divided into two phases, the build phase (build phase) and the probe phase (probe phase). In the build phase, first select the appropriate table as the "build input", build the hash table, and then iterate through another "probe input" table record to probe the hash table to find records that meet the join conditions.

The above figure is an example of querying the corresponding province of a city. Let's assume that city is the build input. During the build phase, the server builds a city hash table, traverses the city table, and puts the rows into the hash table in turn, with the key hash (province_id), and the value is the corresponding city row. `

During the probe phase, the server begins to read rows from the probe input (province). For each row, the hash (province.province_id) value is used as the lookup key to probe the hash table to match the rows.

That is, when the build input can all be loaded into memory, scan each probe line only once, and use the constant time lookup to find the line that matches between the two inputs.

What if there is too much data to be put into memory?

Loading all the build input into memory is undoubtedly the most efficient, but in some cases, there is not enough memory to load the entire table into memory and needs to be processed in batches.

There are two common practices:

Load into memory in batches

1. Read records that can be held in maximum memory to create a hash table to build input to generate a hash table

two。 Traversing the probe input to make a full probe of this part of the hash table

3. Clean up the hash table and restart the process until all the processing is complete.

This approach causes the probe input to be scanned multiple times.

Write to file processing

1. When the memory is used up in the construction hash table phase, the server will write the remaining build input to many small files on disk. Small file blocks can be calculated to be read into memory and create a hash table (to prevent file blocks from being too large to be loaded into memory and need to be separated again)

two。 During the probe phase, since the probe line may match a row of the build input written to the disk, you also need to write the probe input to the disk

3. After the probe phase is complete, read the block file from disk and load it into the memory hash table, then read the response block file from the probe input and detect the match

4. After processing, move to the next pair of block files until all processing is complete.

Implementation of Hash connection in MySQL

MySQL selects the smaller of the two inputs as the build input (in bytes), loads the build input into memory for processing when there is enough memory, and writes to the file if there is not enough.

You can use the join_buffer_size system variable to control the memory usage of hash connections, which cannot exceed this amount, and when this amount is exceeded, MySQL will use files to process it.

If the memory exceeds join_buffer_size and the file exceeds open_files_limit, execution may fail.

You can use the following two solutions:

● increases join_buffer_size to prevent hash connections from overflowing to disk

● increases open_files_limit

When does MySQL use a hash connection?

In MySQL version 8.0.18, if tables are joined together with one or more equal join conditions, and there is no index available for join conditions, a hash join is used. If indexes are available, MySQL tends to use index lookups to support nested loops.

By default, MySQL uses hash connections as much as possible and can be enabled or turned off in two ways:

● sets the global or session variable (hash_join = on or hash_join = off)

SET optimizer_switch= "hash_join=off"

● uses hints (HASH_JOIN or NO_HASH_JOIN).

We will use the following query as an example:

EXPLAIN FORMAT = treeSELECT city.name AS city_name, province.name AS province_nameFROM city JOIN province ON city.province_id = province.province_id

The output is:

| |-> Inner hash join (city.province_id = province.province_id) (cost=1333.82 rows=1329)-> Table scan on city (cost=0.14 rows=391)-> Hash-> Table scan on province (cost=3.65 rows=34) |

Hash joins can also be used in multiple join queries, using hash joins as long as there is an equivalent join.

For example, the following query:

EXPLAIN FORMAT= TREESELECT city.name AS city_name, province.name AS province_name, country.name AS country_nameFROM city JOIN province ON city.province_id = province.province_id AND city.id

< 50 JOIN country ON province.province_id = country.id 输出为： | ->

Inner hash join (city.province_id = country.id) (cost=23.27 rows=2)-> Filter: (city.id

< 50) (cost=5.32 rows=5) ->

Index range scan on city using PRIMARY (cost=5.32 rows=49)-> Hash-> Inner hash join (province.province_id = country.id) (cost=4.00 rows=3)-> Table scan on province (cost=0.59 rows=34)-> Hash-> Table scan on country (cost=0.35 rows=1)

The hash join also applies to the "Cartesian product", that is, no query condition is specified, as follows:

EXPLAIN FORMAT= TREESELECT * FROM city JOIN province

The output is:

| |-> Inner hash join (cost=1333.82 rows=13294)-> Table scan on city (cost=1.17 rows=391)-> Hash-> Table scan on province (cost=3.65 rows=34) |

When does MySQL not use a hash connection?

1. Currently, MySQL hash joins only support inner joins, while reverse joins, semi-joins, and outer joins are still performed using block nesting loops.

two。 If indexes are available, MySQL will prefer to use index lookups to support nested loops

3. When no equivalent query exists, a nested loop is used.

As follows:

EXPLAIN FORMAT=TREESELECT * FROM city JOIN province ON city.province_id < province.province_id

The output is:

| |

How do I see if hash joins are used in statement execution?

EXPLAIN FORMAT= TREE is available in MySQL 8.0.16 and later, and TREE provides tree-like output that describes query processing more accurately than the traditional format, which is the only format that shows the use of hash joins.

In addition, you can also use EXPLAIN ANALYZE to view hash connection information.

After reading the above, have you mastered the method of database hash connection in MySQL? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.