PostgreSQL Source Code interpretation (97)-query statement # 79 (ExecHashJoin function # 5 Mel H. 04/16 Update SLTechnology News&Howtos

PostgreSQL Source Code interpretation (97)-query statement # 79 (ExecHashJoin function # 5 Mel H.

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This section, the fifth part of the introduction to the ExecHashJoin function, mainly introduces the implementation logic of other functions that are dependent on ExecHashJoin, which are used in the HJ_NEED_NEW_BATCH phase, and the main function is ExecHashJoinNewBatch.

I. data structure

JoinState

The base class of Hash/NestLoop/Merge Join

/ *-* JoinState information * * Superclass for state nodes of join plans. * Hash/NestLoop/Merge Join base class *-* / typedef struct JoinState {PlanState ps;// base class PlanState JoinType jointype;// connection type / / when a matching inner tuple is found, if you need to jump to the next outer tuple, the value is T bool single_match. / * True if we should skip to next outer tuple * after finding one inner match * / / join condition expression (except ps.qual) ExprState * joinqual; / * JOIN quals (in addition to ps.qual) * /} JoinState

HashJoinState

Hash Join runtime status structure

/ * these structs are defined in executor/hashjoin.h: * / typedef struct HashJoinTupleData * HashJoinTuple;typedef struct HashJoinTableData * HashJoinTable;typedef struct HashJoinState {JoinState js; / * Base class; its first field is NodeTag * / ExprState * hashclauses;//hash connection condition List * hj_OuterHashKeys; / * external condition linked list; list of ExprState nodes * / List * hj_InnerHashKeys; / * Internal table connection condition List of ExprState nodes * / List * hj_HashOperators; / * operator OIDs linked list; list of operator OIDs * / HashJoinTable hj_HashTable;//Hash table uint32 hj_CurHashValue;// current hash value int hj_CurBucketNo;// current bucket number int hj_CurSkewBucketNo;// line slant bucket number HashJoinTuple hj_CurTuple;// current tuple TupleTableSlot * hj_OuterTupleSlot / / outer relation slot TupleTableSlot * hj_HashTupleSlot;//Hash tuple slot TupleTableSlot * hj_NullOuterTupleSlot;// outer virtual slot TupleTableSlot * hj_NullInnerTupleSlot;// for external connection inner virtual slot TupleTableSlot * hj_FirstOuterTupleSlot;// int hj_JoinState;//JoinState status bool hj_MatchedOuter;// matches whether bool hj_OuterNotEmpty;//outer relation is empty} HashJoinState

HashJoinTable

Hash data structure

Typedef struct HashJoinTableData {hash barrels in int nbuckets; / * memory; logarithm of # buckets in the in-memory hash table * / int log2_nbuckets; / * 2 (nbuckets must be a power of 2); its log2 (nbuckets must be a power of 2) * / int nbuckets_original; / * the number of barrels in the first hash; # buckets when starting the first hash * / int nbuckets_optimal / * optimized number of barrels (per batch); logarithm of optimal # buckets (per batch) * / int log2_nbuckets_optimal; / * 2 Log2 (nbuckets_optimal) * / * buckets [I] is head of list of tuples in i'th in-memory bucket * / bucket [I] is the head item union {/ * unshared array is per-batch storage of the tuple linked list in the first bucket in memory, as are all the tuples * / / unshared arrays are stored in batches, all tuples are like this struct HashJoinTupleData * * unshared / * shared array is per-query DSA area, as are all the tuples * / / shared array is the DSA area of each query. All tuples are such dsa_pointer_atomic * shared;} buckets; bool keepNulls; / * if they do not match, NULL tuples are stored. The value is true to store unmatchable NULL tuples * / bool skewEnabled; / *. Do you use skew optimization? Are we using skew optimization? * / HashSkewBucket * * skewBucket; / * number of tilted hash buckets; hashtable of skew buckets * / int skewBucketLen; / * skewBucket array size; size of skewBucket array (a power of 2!) * / int nSkewBuckets; / * active tilted buckets; number of active skew buckets * / int * skewBucketNums; / * active tilt bucket array index Array indexes of active skew buckets * / int nbatch; / * number of batches; number of batches * / int curbatch; / * current batch, the first round is 0 * current batch #; 0 during 1st pass * / int nbatch_original; / * batch at the start of inner scanning; nbatch when we started inner scan * / int nbatch_outstart / * batch at start of outer scan; nbatch when we started outer scan * / bool growEnabled; / * turn off the tag added by nbatch; flag to shut off nbatch increases * / double totalTuples; / * number of tuples obtained from inner plan; # tuples obtained from inner plan * / double partialTuples; / * number of inner tuples obtained through hashjoin; # tuples obtained from inner plan by me * / double skewTuples / * number of tilted tuples; # tuples inserted into skew tuples * / * * These arrays are allocated for the life of the hash join, but only if * nbatch > 1. A file is opened only when we first write a tuple into it * (otherwise its pointer remains NULL). Note that the zero'th array * elements never get used, since we will process rather than dump out any * tuples of batch zero. * these arrays are allocated during the lifetime of the hash join, but only if nbatch > 1. * the file is opened only when the tuple is written to the file for the first time (otherwise its pointer will remain NULL). * Note that the 0 th array element is never used because the tuple of batch 0 is never dumped. * / BufFile * * innerBatchFile; / * inner virtual temporary file cache for each batch; buffered virtual temp file per batch * / BufFile * * outerBatchFile; / * outer virtual temporary file cache for each batch; buffered virtual temp file per batch * / * Info about the datatype-specific hash functions for the datatypes being * hashed. These are arrays of the same length as the number of hash join * clauses (hash keys). * Information about the data type-specific hash function of the data type being hashed. * these arrays are the same length as the number of hash join clauses (hash keys). * / FmgrInfo * outer_hashfunctions; / * outer hash function FmgrInfo structure; lookup data for hashfunctions * / FmgrInfo * inner_hashfunctions; / * inner hash function FmgrInfo structure; lookup data for hashfunctions * / bool * hashStrict; / * each hash operator is strict? current memory space used by is each hash join operator strict? * / Size spaceUsed; / * tuple Memory space currently used by tuples * / Size spaceAllowed; / * space usage limit; upper limit for space used * / Size spacePeak; / * peak space usage; current space usage of peak space used * / Size spaceUsedSkew; / * tilted hash table; upper limit of skew hash table's current space usage * / Size spaceAllowedSkew; / * tilted hash table usage Upper limit for skew hashtable * / MemoryContext hashCxt; / * the context of the entire hash connection storage; context for whole-hash-join storage * / MemoryContext batchCxt; / * the context of the batch storage; context for this-batch-only storage * / * used for dense allocation of tuples (into linked chunks) * / / for dense allocation of tuples (to link blocks) HashMemoryChunk chunks; / * the whole batch uses a linked list One list for the whole batch * / * Shared and private state for Parallel Hash. * / / shared and private status used by parallel hash HashMemoryChunk current_chunk; / * current chunk;this backend's current chunk of background processes * / dsa_area * area; / * DSA area used to allocate memory; DSA area to allocate memory from * / ParallelHashJoinState * parallel_state;// parallel execution status ParallelHashJoinBatchAccessor * batches;// parallel accessor dsa_pointer current_chunk_shared / / start pointer of the current chunk} HashJoinTableData;typedef struct HashJoinTableData * HashJoinTable

HashJoinTupleData

Hash connection tuple data

/ *-* hash-join hash table structures * * Each active hashjoin has a HashJoinTable control block, which is * palloc'd in the executor's per-query context. All other storage needed * for the hashjoin is kept in private memory contexts, two for each hashjoin. * This makes it easy and fast to release the storage when we don't need it * anymore. (Exception: data associated with the temp files lives in the * per-query context too, since we always call buffile.c in that context.) * each active hashjoin has a hashable control block, which is allocated through palloc in each query context of the executing program. * all other storage required by hashjoin is kept in a private memory context, with two for each hashjoin. This makes it easy and fast to release it when it is no longer needed. * (exception: data related to temporary files also exists in the context of each query, because buffile.c is always called in this case.) * * The hashtable contexts are made children of the per-query context, ensuring * that they will be discarded at end of statement even if the join is * aborted early by an error. The (Likewise, any temporary files we make will * be cleaned up by the virtual file manager in event of an error.) * hashtable context is a subcontext of each query context, ensuring that they are discarded at the end of the statement, even if the connection is aborted early due to an error. * (similarly, if an error occurs, the virtual file manager cleans up any temporary files created.) * * Storage that should live through the entire join is allocated from the * "hashCxt", while storage that is only wanted for the current batch is * allocated in the "batchCxt". By resetting the batchCxt at the end of * each batch, we free all the per-batch storage reliably and without tedium. * Storage space over the entire connection should be allocated from "hashCxt", while only storage space that requires the current batch should be allocated in "batchCxt". * by resetting the batchCxt at the end of each batch, you can reliably release all storage for each batch without feeling tedious. * During first scan of inner relation, we get its tuples from executor. * If nbatch > 1 then tuples that don't belong in first batch get saved * into inner-batch temp files. The same statements apply for the * first scan of the outer relation, except we write tuples to outer-batch * temp files. After finishing the first scan, we do the following for * each remaining batch: * 1. Read tuples from inner batch file, load into hash buckets. 2. Read tuples from outer batch file, match to hash buckets and output. * in the first scan of the internal relationship, its tuple was obtained from the executor. * if nbatch > 1, tuples that are not part of the first batch will be saved to an intra-batch temporary file. * the same statement applies to the first scan of the external relationship, but we write the tuple to the external batch temporary file. * after the first scan, we do the following for the remaining tuples of each batch: * 1. The tuple is read from the internal batch file and loaded into the hash bucket. * 2. Reads tuples from an external batch file, matching hash buckets and output. * It is possible to increase nbatch on the fly if the in-memory hash table * gets too big. The hash-value-to-batch computation is arranged so that this * can only cause a tuple to go into a later batch than previously thought, * never into an earlier batch. When we increase nbatch, we rescan the hash * table and dump out any tuples that are now of a later batch to the correct * inner batch file. Subsequently, while reading either inner or outer batch * files, we might find tuples that no longer belong to the current batch; * if so, we just dump them out to the correct batch file. * if the hash table in memory is too large, you can dynamically increase the nbatch. * the calculation of hash values to batches is arranged as follows: * this only causes tuples to enter later batches than previously thought, not earlier batches. * when nbatch is added, rescan the hash table and dump any tuples that now belong to later batches to the correct internal batch file. * later, when reading internal or external batch files, you may find tuples that are no longer part of the current batch; * if so, simply dump them to the correct batch file. *-* / / * these are in nodes/execnodes.h: * / * typedef struct HashJoinTupleData * HashJoinTuple; * / / * typedef struct HashJoinTableData * HashJoinTable * / typedef struct HashJoinTupleData {/ * link to next tuple in same bucket * / / link the hash value of the next tuple union {struct HashJoinTupleData * unshared; dsa_pointer shared;} next; uint32 hashvalue; / * tuple in the same bucket; tuple's hash code * / * Tuple data, in MinimalTuple format, follows on a MAXALIGN boundary * /} HashJoinTupleData # define HJTUPLE_OVERHEAD MAXALIGN (sizeof (HashJoinTupleData)) # define HJTUPLE_MINTUPLE (hjtup)\ ((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD)) II. Source code interpretation

ExecHashJoinNewBatch

Switch to a new hashjoin batch. If successful, return T; completed, return F

/ *- HJ_FILL_OUTER_TUPLE stage-* / / see ExecHashJoin/*-- -HJ_FILL_ INNER_TUPLES stage-* / / see ExecHashJoin/*- -HJ_NEED_NEW_BATCH stage Paragraph-* / * * ExecHashJoinNewBatch * switch to a new hashjoin batch * Switch to a new hashjoin batch * * Returns true if successful False if there are no more batches. * if successful, T is returned; if completed, F * / static boolExecHashJoinNewBatch (HashJoinState * hjstate) {HashJoinTable hashtable = hjstate- > hj_HashTable;//Hash Table int nbatch;// batches int curbatch;// current batch BufFile * innerFile;//inner relation cache file TupleTableSlot * slot;//slot uint32 hashvalue;// hash value nbatch = hashtable- > nbatch; curbatch = hashtable- > curbatch If (curbatch > 0) {/ * * We no longer need the previous outer batch file; close it right * away to free disk space. * the previous external batch file is no longer needed; close it to free disk space. * / if (hashtable- > outerBatchFile [curbatch]) BufFileClose (hashtable- > outerBatchFile [curbatch]); hashtable- > outerBatchFile [curbatch] = NULL;} else / * curbatch = = 0, the first batch has just been completed; we just finished the first batch * / {/ * * Reset some of the skew optimization state variables, since we no * batch. The * memory context reset we are about to do will release the skew * hashtable itself. * reset some skew optimization state variables because we no longer need to consider skew tuples after the first batch. * the memory context reset we are about to do will release the skewed hash table itself. * / hashtable- > skewEnabled = false; hashtable- > skewBucket = NULL; hashtable- > skewBucketNums = NULL; hashtable- > nSkewBuckets = 0; hashtable- > spaceUsedSkew = 0;} / * * We can always skip over any batches that are completely empty on both * sides. We can sometimes skip over batches that are empty on only one * side, but there are exceptions: * can skip any batch that is empty on both sides. Sometimes we can skip batches that are empty only on one side, but there are exceptions: * * 1. In a left/full outer join, we have to process outer batches even if * the inner batch is empty. Similarly, in a right/full outer join, we * have to process inner batches even if the outer batch is empty. * 1. In the left / all-outside connection, we have to process the external batch data even if the internal batch is empty. * similarly, in a right / full external connection, inner batch data must be processed even if the outer batch data is empty. * * 2.If we have increased nbatch since the initial estimate, we have to * scan inner batches since they might contain tuples that need to be * reassigned to later inner batches. * 2. If nbatch is added after the initial estimate, internal batches must be scanned * because they may contain tuples that need to be reassigned to subsequent internal batches. * * 3.Similarly, if we have increased nbatch since starting the outer * scan, we have to rescan outer batches in case they contain tuples that * need to be reassigned. Similarly, if nbatch is added after starting an external scan, external batches must be rescanned * in case they contain tuples that need to be reassigned. * / curbatch++; while (curbatch

< nbatch && (hashtable->

< nbatch &&(gdb) 997 (hashtable->

OuterBatchFile [curbatch] = = NULL | | (gdb) p hashtable- > outerBatchFile [curbatch] $7 = (BufFile *) 0x1d85290 (gdb) p hashtable- > outerBatchFile [curbatch] $8 = (BufFile *) 0x1d85290

Set the current batch and rebuild the Hash table

(gdb) 1023 if (curbatch > = nbatch) (gdb) 1026 hashtable- > curbatch = curbatch; (gdb) 1031 ExecHashTableReset (hashtable)

Get inner relation batch temporary files

(gdb) 1033 innerFile = hashtable- > innerBatchFile [curbatch]; (gdb) 1035 if (innerFile! = NULL) (gdb) p innerFile$9 = (BufFile *) 0x1cc0540

Temporary file is not NULL, read file

(gdb) n1037 if (BufFileSeek (innerFile, 0L, SEEK_SET)) (gdb) 1042 while ((slot = ExecHashJoinGetSavedTuple (hjstate)

Enter the function ExecHashJoinGetSavedTuple

Gdb) stepExecHashJoinGetSavedTuple (hjstate=0x1c40fd8, file=0x1cc0540, hashvalue=0x7ffeace60824, tupleSlot=0x1c4cc20) at nodeHashjoin.c:12591259 CHECK_FOR_INTERRUPTS (); (gdb)

ExecHashJoinGetSavedTuple- > read header 8 bytes (header, type void *, size 8 bytes on 64 bit machines)

Gdb) n1266 nread = BufFileRead (file, (void *) header, sizeof (header)); (gdb) 1267 if (nread = = 0) / * end of file * / (gdb) p nread$10 = 8 (gdb) n1272 if (nread! = sizeof (header)) (gdb)

ExecHashJoinGetSavedTuple- > get hash value (1978434688)

(gdb) 1276 * hashvalue = header [0]; (gdb) n1277 tuple = (MinimalTuple) palloc (header [1]); (gdb) p * hashvalue$11 = 1978434688

ExecHashJoinGetSavedTuple- > get the length of tuple& tuple

(gdb) n1278 tuple- > t_len = header [1]; (gdb) 1281 header [1]-sizeof (uint32)) (gdb) p tuple- > t_len$16 = 24 (gdb) p * tuple$17 = {t_len = 24, mt_padding = "\ 177\ 177\ 177\ 177\ 177\ 177", t_infomask2 = 32639, t_infomask = 32639, t_hoff = 127'\ 177, t_bits = 0x1c5202f "\ 177\ 177\ 177\ 17777\ 177 ~\ 177\ 177\ 177\ 177\ 177\ 177\ 177\ 177\ 177\ 177} (gdb)

ExecHashJoinGetSavedTuple- > read files to get tuples based on size

(gdb) n1279 nread = BufFileRead (file, (gdb) 1282 if (nread! = header [1]-sizeof (uint32)) (gdb) p header [1] $18 = 24 (gdb) p sizeof (uint32) $19 = 4 (gdb) p * tuple$20 = {t_len = 24, mt_padding = "\ 000\ 000\ 000", t_infomask2 = 3, t_infomask = 2, t_hoff = 24'\ 0303, t_bits = 0x1c5202f "}

ExecHashJoinGetSavedTuple- > is stored in slot to complete the call

(gdb) n1286 return ExecStoreMinimalTuple (tuple, tupleSlot, true); (gdb) 1287} (gdb) ExecHashJoinNewBatch (hjstate=0x1c40fd8) at nodeHashjoin.c:10511051 ExecHashTableInsert (hashtable, slot, hashvalue)

Insert into the Hash table

(gdb) 1051 ExecHashTableInsert (hashtable, slot, hashvalue)

Enter ExecHashTableInsert

(gdb) stepExecHashTableInsert (hashtable=0x1c6e1c0, slot=0x1c4cc20, hashvalue=3757101760) at nodeHash.c:15931593 MinimalTuple tuple = ExecFetchSlotMinimalTuple (slot); (gdb)

ExecHashTableInsert- > get batch number and hash bucket number

(gdb) n1597 ExecHashGetBucketAndBatch (hashtable, hashvalue, (gdb) 1603 if (batchno = = hashtable- > curbatch) (gdb) p batchno$21 = 1 (gdb) p bucketno$22 = 21184 (gdb) (gdb) p hashtable- > curbatch$23 = 1

The ExecHashTableInsert- > batch number is the same as the batch number in the Hash table. Put the tuple in the Hash table.

Number of regular tuples = 100000

(gdb) n1610 double ntuples = (hashtable- > totalTuples-hashtable- > skewTuples); (gdb) n1613 hashTupleSize = HJTUPLE_OVERHEAD + tuple- > tweak; (gdb) p ntuples$24 = 100000

ExecHashTableInsert- > create HashJoinTuple and reset tuple matching tags

(gdb) n1614 hashTuple = (HashJoinTuple) dense_alloc (hashtable, hashTupleSize); (gdb) 1616 hashTuple- > hashvalue = hashvalue; (gdb) 1617 memcpy (HJTUPLE_MINTUPLE (hashTuple), tuple, tuple- > t_len); (gdb) 1625 HeapTupleHeaderClearMatch (HJTUPLE_MINTUPLE (hashTuple)); (gdb)

ExecHashTableInsert- > tuples are placed in front of the bucket list of the Hash table

(gdb) n1628 hashTuple- > next.unshared = hashtable- > buckets.unshared [bucketno]; (gdb) 1629 hashtable- > buckets.unshared [bucketno] = hashTuple; (gdb) 1636 if (hashtable- > nbatch = = 1 & & (gdb)

ExecHashTableInsert- > adjust or record the peak memory usage of the Hash table and return, back to ExecHashJoinNewBatch

(gdb) 1649 hashtable- > spaceUsed + = hashTupleSize; (gdb)... (gdb) 1667} (gdb) nExecHashJoinNewBatch (hjstate=0x1c40fd8) at nodeHashjoin.c:10421042 while ((slot = ExecHashJoinGetSavedTuple (hjstate)

Insert in a loop into the Hash table

1042 while (slot = ExecHashJoinGetSavedTuple (hjstate, (gdb) n1051 ExecHashTableInsert (hashtable, slot, hashvalue);

DONE!

IV. Reference materials

Hash Joins: Past, Present and Future/PGCon 2017

A Look at How Postgres Executes a Tiny Join-Part 1

A Look at How Postgres Executes a Tiny Join-Part 2

Assignment 2 Symmetric Hash Join

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.