What are the Cost functions used to calculate merge join in PostgreSQL 07/06 Update SLTechnology News&Howtos

What are the Cost functions used to calculate merge join in PostgreSQL

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

< 0) rescannedtuples = 0; } /* 对于重复扫描,需要额外增加成本.We'll inflate various costs this much to account for rescanning */ rescanratio = 1.0 + (rescannedtuples / inner_path_rows); /* * Decide whether we want to materialize the inner input to shield it from * mark/restore and performing re-fetches. Our cost model for regular * re-fetches is that a re-fetch costs the same as an original fetch, * which is probably an overestimate; but on the other hand we ignore the * bookkeeping costs of mark/restore. Not clear if it's worth developing * a more refined model. So we just need to inflate the inner run cost by * rescanratio. * 决定是否要物化inner relation的访问路径，以使其不受mark/restore和执行重新获取操作的影响。 * 对于定期重取的成本模型是，一次重取的成本与一次原始取回的成本相同，当然这可能高估了该成本; * 但另一方面，忽略了mark/restore的簿记成本。 * 目前尚不清楚是否值得开发一个更完善的模型。所以只需要把内部运行成本乘上相应的比例。 */ bare_inner_cost = inner_run_cost * rescanratio; /* * When we interpose a Material node the re-fetch cost is assumed to be * just cpu_operator_cost per tuple, independently of the underlying * plan's cost; and we charge an extra cpu_operator_cost per original * fetch as well. Note that we're assuming the materialize node will * never spill to disk, since it only has to remember tuples back to the * last mark. (If there are a huge number of duplicates, our other cost * factors will make the path so expensive that it probably won't get * chosen anyway.) So we don't use cost_rescan here. * 当插入一个物化节点时，重新取回的成本被假设为cpu_operator_cost/每个元组，该成本独立于底层计划的成本; * 对每次原始取回增加额外的cpu_operator_cost。 * 注意，我们假设物化节点永远不会溢出到磁盘上，因为它只需要记住元组的最后一个标记。 * (如果有大量的重复项，其他成本因素会使路径变得非常昂贵，可能无论如何都不会被选中。) * 因此在这里,我们不调用cost_rescan。 * * Note: keep this estimate in sync with create_mergejoin_plan's labeling * of the generated Material node. * 注意:与create_mergejoin_plan函数生成的物化节点标记保持一致。 */ mat_inner_cost = inner_run_cost + cpu_operator_cost * inner_path_rows * rescanratio; /* * If we don't need mark/restore at all, we don't need materialization. * 如果不需要mark/restore,那么也不需要物化 */ if (path->

Skip_mark_restore) path- > materialize_inner = false; / * * Prefer materializing if it looks cheaper, unless the user has asked to * suppress materialization. * if materialization seems to cost less, then choose materialization, as long as the user allows materialization. * / else if (enable_material & & mat_inner_cost

< bare_inner_cost) path->

Materialize_inner = true; / * * Even if materializing doesn't look cheaper, we * must* do it if the * inner path is to be used directly (without sorting) and it doesn't * support mark/restore. * even if the cost of materialization seems not low, * if the access path of inner relation is directly used (no sorting) and does not support mark/restore, we must do so. * * Since the inner side must be ordered, and only Sorts and IndexScans can * create order to begin with, and they both support mark/restore, you * might think there's no problem-but you'd be wrong. Nestloop and * merge joins can * preserve* the order of their inputs, so they can be * selected as the input of a mergejoin, and they don't support * mark/restore at present. * because the inner side must be ordered, and only Sorts and IndexScan can create sorting from the beginning, * and they both support mark/restore, you might think there's nothing wrong with doing this-- but wrong. * Nestloop join and merge join can "preserve" their input order, * so they can be selected as inputs to merge join, and they do not currently support mark/restore. * * We don't test the value of enable_material here, because * materialization is required for correctness in this case, and turning * it off does not entitle us to deliver an invalid plan. * ignore the enable_material setting here, because in this case, priority is given to ensuring correctness, * and turning it off will cause the optimizer to produce an invalid execution plan. * / else if (innersortkeys = = NIL & &! ExecSupportsMarkRestore (inner_path)) path- > materialize_inner = true; / * * Also, force materializing if the inner path is to be sorted and the * sort is expected to spill to disk This is because the final merge * pass can be done on-the-fly if it doesn't have to support mark/restore. * We don't try to adjust the cost estimates for this consideration, * though. * in addition, it is enforced if the inner relation access path is to be sorted and the sort may overflow to disk. * this is because if you do not need to support mark/restore, the final merge process can be completed in operation. * however, we have not tried to adjust the cost estimate for this purpose. * * Since materialization is a performance optimization in this case, * rather than necessary for correctness, we skip it if enable_material is * off. * because in this case, materialization is a performance optimization, not a necessary condition to ensure correctness, * so if enable_material is turned off, ignore it. * / else if (enable_material & & innersortkeys! = NIL & & relation_byte_size (inner_path_rows, inner_path- > pathtarget- > width) > (work_mem * 1024L)) path- > materialize_inner = true; else path- > materialize_inner = false / * adjust the runtime cost. Charge the right incremental cost for the chosen case * / if (path- > materialize_inner) run_cost + = mat_inner_cost; else run_cost + = bare_inner_cost; / * CPU cost calculation. CPU costs * / * The number of tuple comparisons needed is approximately number of outer * rows plus number of inner rows plus number of rescanned tuples (can we * refine this?). At each one, we need to evaluate the mergejoin quals. * the number of tuples to be compared is approximately the number of external rows plus the number of internal rows plus the number of rescan tuples (can it be improved?) * at each point, the number of merged tuples needs to be calculated. * / startup_cost + = merge_qual_cost.startup; startup_cost + = merge_qual_cost.per_tuple * (outer_skip_rows + inner_skip_rows * rescanratio); run_cost + = merge_qual_cost.per_tuple * ((outer_rows-outer_skip_rows) + (inner_rows-inner_skip_rows) * rescanratio) / * * For each tuple that gets through the mergejoin proper, we charge * cpu_tuple_cost plus the cost of evaluating additional restriction * clauses that are to be applied at the join. (This is pessimistic since * not all of the quals may get evaluated at each tuple.) * for each tuple obtained through a merge connection, the cost of each tuple is cpu_tuple_cost, plus the cost of additional restrictions that will be applied on the connection. * (this is pessimistic, of course, because not all conditions can be applied to every tuple.) * * Note: we could adjust for SEMI/ANTI joins skipping some qual * evaluations here, but it's probably not worth the trouble. * Note: we can adjust the half / reverse connection to skip some conditional evaluation, but it may not be worth it. * / startup_cost + = qp_qual_cost.startup; cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple; run_cost + = cpu_per_tuple * mergejointuples; / * the projection column is estimated by the output line rather than the scanned tuple .tlist eval costs are paid per output row, not per tuple scanned * / startup_cost + = path- > jpath.path.pathtarget- > cost.startup Run_cost + = path- > jpath.path.pathtarget- > cost.per_tuple * path- > jpath.path.rows; path- > jpath.path.startup_cost = startup_cost; path- > jpath.path.total_cost = startup_cost + run_cost;} III. Tracking analysis

The test script is as follows

Select a.from t_grxx on t1.dwbh b.grbhjje from t_dwxx a, lateral (select t1.dwbhreign t1.grbhrecoveryt2.je from t_grxx T1 inner join t_jfxx T2 on t1.dwbh = a.dwbh and t1.grbh = t2.grbh) border by b.dwbh

Start gdb and set breakpoint

(gdb) b try_mergejoin_pathBreakpoint 1 at 0x7aeeaf: file joinpath.c, line 572. (gdb) cContinuing.Breakpoint 1, try_mergejoin_path (root=0x166c880, joinrel=0x16864d0, outer_path=0x167f190, inner_path=0x167f9d0, pathkeys=0x1 innersortkeys=0x1686c28, jointype=JOIN_INNER, extra=0x7ffea604f500, is_partial=false) at joinpath.c:572572 if (is_partial)

Enter the initial_cost_mergejoin function

(gdb) 615 initial_cost_mergejoin (root, & workspace, jointype, mergeclauses, (gdb) stepinitial_cost_mergejoin (root=0x166c880, workspace=0x7ffea604f360, jointype=JOIN_INNER, mergeclauses=0x1686bc8, outer_path=0x167f190, inner_path=0x167f9d0, outersortkeys=0x1686b68, innersortkeys=0x1686c28, extra=0x7ffea604f500) at costsize.c:26072607 Cost startup_cost = 0

Initialization parameter

2607 Cost startup_cost = 0; (gdb) n2608 Cost run_cost = 0.

There is a merge condition, JOIN_INNER connection, enter the corresponding branch

... 2639 if (mergeclauses & & jointype! = JOIN_FULL) (gdb) 2641 RestrictInfo * firstclause = (RestrictInfo *) linitial (mergeclauses);

View the constraints, that is, t_dwxx.dwbh = t_grxx.dwbh

(gdb) set $arg1= (RelabelType *) ((OpExpr *) firstclause- > clause)-> args- > head- > data.ptr_value (gdb) set $arg2= (RelabelType *) ((OpExpr *) firstclause- > clause)-> args- > head- > next- > data.ptr_value (gdb) p * (Var *) $arg1- > arg$9 = {xpr = {type = T_Var}, varno = 1, varattno = 2, vartype = 1043, vartypmod = 24, varcollid = 100,varlevelsup = 0, varnoold = 1, varoattno = 2 Location = 218} (gdb) p * (Var *) $arg2- > arg$10 = {xpr = {type = T_Var}, varno = 3, varattno = 1, vartype = 1043, vartypmod = 14, varcollid = 100, varlevelsup = 0, varnoold = 3, varoattno = 1, location = 208}

Get the sort keys for the connected outer and inner relation

(gdb) 2649 opathkeys = outersortkeys? Outersortkeys: outer_path- > pathkeys; (gdb) n2650 ipathkeys = innersortkeys? Innersortkeys: inner_path- > pathkeys;...

Get the cache selection rate

(gdb) 2663 cache = cached_scansel (root, firstclause, opathkey); (gdb) 2666 outer_path- > parent- > relids)) (gdb) p * cache$15 = {opfamily = 1994, collation = 100, strategy = 1, nulls_first = false, leftstartsel = 0, leftendsel = 0.99989010989010996, rightstartsel = 0.0091075159436798652, rightendsel = 1}

Selection rate assignment, outer relation on the left side of the connection clause

2665 if (bms_is_subset (firstclause- > left_relids, (gdb) 2669 outerstartsel = cache- > leftstartsel; (gdb) 2670 outerendsel = cache- > leftendsel; (gdb) 2671 innerstartsel = cache- > rightstartsel; (gdb) 2672 innerendsel = cache- > rightendsel

Convert the selection rate to the number of rows

(gdb) 2705 outer_skip_rows = rint (outer_path_rows * outerstartsel); (gdb) 2706 inner_skip_rows = rint (inner_path_rows * innerstartsel); (gdb) 2707 outer_rows = clamp_row_est (outer_path_rows * outerendsel); (gdb) 2708 inner_rows = clamp_row_est (inner_path_rows * innerendsel); (gdb) p outer_skip_rows$16 = 0 (gdb) p inner_skip_rows$17 = 9999 (gdb) p inner_rows$19 = 100000

Calculate the sorting cost of outer relation and assign a value

... (gdb) 2728 if (outersortkeys) / * do we need to sort outer? * / (gdb) 2730 cost_sort (& sort_path, (gdb) n2735 outer_path- > pathtarget- > width,... (gdb) n2739 startup_cost + = sort_path.startup_cost (gdb) p sort_path$24 = {type = T_Invalid, pathtype = T_Invalid, parent = 0x167f9d0, pathtarget = 0x0, param_info = 0x0, parallel_aware = false, parallel_safe = false, parallel_workers = 0, rows = 10000, startup_cost = 828.38561897747286, total_cost = 853.38561897747286, pathkeys = 0x167f9d0}

Calculate the sorting cost of inner relation and assign a value

... 2754 if (innersortkeys) / * do we need to sort inner? * / (gdb) 2756 cost_sort (& sort_path,... (gdb) p sort_path$25 = {type = T_Invalid, pathtype = T_Invalid, parent = 0x167f9d0, pathtarget = 0x0, param_info = 0x0, parallel_aware = false, parallel_safe = false, parallel_workers = 0, rows = 100000, startup_cost = 10030.82023721841, total_cost = 10280.82023721841, pathkeys = 0x167f9d0}

Assign a value to workspace to end the initial_cost_mergejoin function call.

(gdb) 2791 workspace- > startup_cost = startup_cost; (gdb) 2792 workspace- > total_cost = startup_cost + run_cost + inner_run_cost; (gdb) 2794 workspace- > run_cost = run_cost; (gdb) 2795 workspace- > inner_run_cost = inner_run_cost; (gdb) 2796 workspace- > outer_rows = outer_rows; (gdb) 2797 workspace- > inner_rows = inner_rows; (gdb) 2798 workspace- > outer_skip_rows = outer_skip_rows (gdb) 2799 workspace- > inner_skip_rows = inner_skip_rows; (gdb) 2800}

Enter the add_path_precheck function

(gdb) ntry_mergejoin_path (root=0x166c880, joinrel=0x16864d0, outer_path=0x167f190, inner_path=0x167f9d0, pathkeys=0x1686b68, mergeclauses=0x1686bc8, outersortkeys=0x1686b68, innersortkeys=0x1686c28, jointype=JOIN_INNER, extra=0x7ffea604f500, is_partial=false) at joinpath.c:620620 if (add_path_precheck (joinrel, (gdb) stepadd_path_precheck (parent_rel=0x16864d0, startup_cost=10861.483356195882, total_cost=11134.203356195881, pathkeys=0x1686b68, required_outer=0x0) at pathnode.c:666666 new_path_pathkeys = required_outer? NIL: pathkeys

Parent_rel- > pathlist is NULL. End the call and return true.

671 foreach (p1, parent_rel- > pathlist) (gdb) p * parent_rel- > pathlistCannot access memory at address 0x0 (gdb) n719 return true

Enter the final_cost_mergejoin function

(gdb) b final_cost_mergejoinBreakpoint 2 at 0x7a00c9: file costsize.c, line 2834. (gdb) cContinuing.Breakpoint 2, final_cost_mergejoin (root=0x166c880, path=0x1686cb8, workspace=0x7ffea604f360, extra=0x7ffea604f500) at costsize.c:28342834 Path * outer_path = path- > jpath.outerjoinpath

View access paths for outer relation and inner relation

2834 Path * outer_path = path- > jpath.outerjoinpath; (gdb) n2835 Path * inner_path = path- > jpath.innerjoinpath; (gdb) 2836 double inner_path_rows = inner_path- > rows (gdb) p * outer_path$26 = {type = T_Path, pathtype = T_SeqScan, parent = 0x166c2c0, pathtarget = 0x1671670, param_info = 0x0, parallel_aware = false, parallel_safe = true, parallel_workers = 0, rows = 10000, startup_cost = 0, total_cost = 164,pathkeys = 0x0} (gdb) p * inner_path$27 = {type = T_Path, pathtype = T_SeqScan, parent = 0x166c4d8, pathtarget = 0x1673510, param_info = 0x0, parallel_aware = false, false = parallel_safe, parallel_safe = 0 Rows = 100000, startup_cost = 0, total_cost = 1726, pathkeys = 0x0}

Assignment of other parameters

(gdb) n2837 List * mergeclauses = path- > path_mergeclauses; (gdb) 2838 List * innersortkeys = path- > innersortkeys;...

Calculate the cost of mergequals and qpquals respectively

(gdb) 2886 cost_qual_eval (& merge_qual_cost, mergeclauses, root); (gdb) 2887 cost_qual_eval (& qp_qual_cost, path- > jpath.joinrestrictinfo, root); (gdb) 2888 qp_qual_cost.startup-= merge_qual_cost.startup; (gdb) p merge_qual_cost$28 = {startup = 0, per_tuple = 0.0025000000000000001} (gdb) p qp_qual_cost$29 = {startup = 0, per_tuple = 0.0025000000000000001}

Obtain the approximate number of tuples through the mergequals condition

(gdb) 2910 mergejointuples = approx_tuple_count (root, & path- > jpath, mergeclauses); (gdb) 2938 if (IsA (outer_path, UniquePath) | | path- > skip_mark_restore) (gdb) p mergejointuples$30 = 100000

Calculate the repeat scan rate caused by the same tuple

(gdb) n2942 rescannedtuples = mergejointuples-inner_path_rows; (gdb) 2944 if (rescannedtuples)

< 0)(gdb) 2948 rescanratio = 1.0 + (rescannedtuples / inner_path_rows); (gdb) p rescanratio$31 = 1(gdb) p rescannedtuples$32 = 0 物化处理,结果是inner relation扫描无需物化:path->

Materialize_inner = false

(gdb) 2974 mat_inner_cost = inner_run_cost + (gdb) n2980 if (path- > skip_mark_restore) (gdb) 2987 else if (enable_material & & mat_inner_cost)

< bare_inner_cost)(gdb) 3006 else if (innersortkeys == NIL &&(gdb) 3021 else if (enable_material && innersortkeys != NIL &&(gdb) 3023 inner_path->

Pathtarget- > width) > (gdb) p mat_inner_cost$34 = 497.722500000003 .3021 else if (enable_material & & innersortkeys! = NIL & & (gdb) 3027 path- > materialize_inner = false

Calculate the cost

(gdb) n3043 startup_cost + = merge_qual_cost.per_tuple * (gdb) 3044 (outer_skip_rows + inner_skip_rows * rescanratio); (gdb) 3043 startup_cost + = merge_qual_cost.per_tuple * (gdb) 3045 run_cost + = merge_qual_cost.per_tuple * (gdb) 3046 ((outer_rows-outer_skip_rows) + (gdb) 3047 (inner_rows-inner_skip_rows) * rescanratio) (gdb) 3046 ((outer_rows-outer_skip_rows) + (gdb) 3045 run_cost + = merge_qual_cost.per_tuple * (gdb) 3058 startup_cost + = qp_qual_cost.startup; (gdb) 3059 cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple; (gdb) 3060 run_cost + = cpu_per_tuple * mergejointuples; (gdb) 3063 startup_cost + = path- > jpath.path.pathtarget- > cost.startup (gdb) 3064 run_cost + = path- > jpath.path.pathtarget- > cost.per_tuple * path- > jpath.path.rows; (gdb) 3066 path- > jpath.path.startup_cost = startup_cost; (gdb) 3067 path- > jpath.path.total_cost = startup_cost + run_cost; (gdb) 3068}

Complete the call

(gdb) create_mergejoin_path (root=0x166c880, joinrel=0x16864d0, jointype=JOIN_INNER, workspace=0x7ffea604f360, extra=0x7ffea604f500, outer_path=0x167f190, inner_path=0x167f9d0, restrict_clauses=0x16869a8, pathkeys=0x1686b68, required_outer=0x0, mergeclauses=0x1686bc8, outersortkeys=0x1686b68, innersortkeys=0x1686c28) at pathnode.c:22982298 return pathnode

MergePath Node Information

2298 return pathnode (gdb) p * pathnode$36 = {jpath = {path = {type = T_MergePath, pathtype = T_MergeJoin, parent = 0x16864d0, pathtarget = 0x1686708, param_info = 0x0, parallel_aware = false, parallel_safe = true, parallel_workers = 0, rows = 100000, startup_cost = 10863.760856195882, total_cost = 12409.200856195883, pathkeys = 0x1686b68}, jointype = JOIN_INNER, inner_unique = false, outerjoinpath = 0x167f190, innerjoinpath = 0x167f9d0, joinrestrictinfo = 0x16869a8}, path_mergeclauses = 0x1686bc8, 0x1686bc8 = outersortkeys, outersortkeys = outersortkeys Skip_mark_restore = false, materialize_inner = false} so far I believe that you have a deeper understanding of "what are the Cost functions used to calculate merge join in PostgreSQL?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.