Implementation Logic of query_planner Master Planning function make_one_rel in PostgreSQL 04/21 Update SLTechnology News&Howtos

Implementation Logic of query_planner Master Planning function make_one_rel in PostgreSQL

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the implementation logic of query_planner master planning function make_one_rel in PostgreSQL, which is very detailed and has certain reference value. Interested friends must read it!

1. Paths and Join Pairs

Paths and Join Pairs

Access path and connection pair

During the planning/optimizing process, we build "Path" trees representing

The different ways of doing a query. We select the cheapest Path that

Generates the desired relation and turn it into a Plan to pass to the

Executor. (There is pretty nearly an one-to-one correspondence between the

Path and Plan trees, but Path nodes omit info that won't be needed during

Planning, and include info needed for planning that won't be needed by the

Executor.)

During the planning / optimization process, different methods of executing queries are represented by building a "Path" tree, in which the lowest cost path (Path structure) for generating the required Relation is selected and transformed into a Plan structure and passed to the executor. (there is almost an one-to-one correspondence between the Path and the Plan tree. However, the Path node omits information that is not needed during the planning period but contains information that is not needed during execution but is needed during the planning period)

The optimizer builds a RelOptInfo structure for each base relation used in

The query. Base rels are either primitive tables, or subquery subselects

That are planned via a separate recursive invocation of the planner. A

RelOptInfo is also built for each join relation that is considered during

Planning. A join rel is simply a combination of base rels. There is only

One join RelOptInfo for any given set of baserels-for example, the join

{A B C} is represented by the same RelOptInfo no matter whether we build it

By joining An and B first and then adding C, or joining B and C first and

Then adding A, etc. These different means of building the joinrel are represented as Paths. For each RelOptInfo we build a list of Paths that

Represent plausible ways to implement the scan or join of that relation.

Once we've considered all the plausible Paths for a rel, we select the one

That is cheapest according to the planner's cost estimates. The final plan

Is derived from the cheapest Path for the RelOptInfo that includes all the

Base rels of the query.

The optimizer creates a RelOptInfo structure for each base relation used in the query. The basic relationship (base relation) includes the original table (primitive table), subqueries (generating plans through recursive calls to independent planners), and Relation. The relation generated by join (join rel) is a combination of basic relations (which can be understood as a new relationship generated by join operation). For a given set of underlying relationships, only one connected RelOptInfo structure is generated, and it doesn't matter how the RelOptInfo is generated (for example, the underlying set {A B C}, An and B can be connected first and then C, or B and C can be connected first and then with A). These methods of building join rel are represented by Paths. For each RelOptInfo, the optimizer builds a Paths linked list to represent these "plausible" methods for scanning or joining. For all these possible paths, the optimizer chooses the lowest-cost access path based on cost estimates. The final execution plan is generated from the access path with the lowest cost of RelOptInfo (the resulting new relationship).

Possible Paths for a primitive table relation include plain old sequential

Scan, plus index scans for any indexes that exist on the table, plus bitmap

Index scans using one or more indexes. Specialized RTE types, such as

Function RTEs, may have only one possible Path.

Possible paths to access the original table include ordinary sequential scan / index-based index scan / bitmap index scan. For some special RTE types, such as the function RTEs, there may be only one possible path.

Joins always occur using two RelOptInfos. One is outer, the other inner.

Outers drive lookups of values in the inner. In a nested loop, lookups of

Values in the inner occur by scanning the inner path once per outer tuple

To find each matching inner row. In a mergejoin, inner and outer rows are

Ordered, and are accessed in order, so only one scan is required to perform

The entire join: both inner and outer paths are scanned in-sync. (There's

Not a lot of difference between inner and outer in a mergejoin...) In a

Hashjoin, the inner is scanned first and all its rows are entered in a

Hashtable, then the outer is scanned and for each row we lookup the join

Key in the hashtable.

Joins usually occur between two RelOptInfo, commonly known as external tables and internal tables, in which external tables are also called driver tables. In the Nested Loop join mode, for each tuple outside, the internal table is accessed to scan the data rows that meet the conditions. Merge Join join mode, the tuples of the external table and the internal table are sorted, and each tuple of the external table and the internal table is accessed sequentially, which only needs to be scanned synchronously once. Hash Join connection mode scans the internal table and establishes the HashTable first, and then scans the external table. Scan the hash table for each row of the external table to find matching rows.

A Path for a join relation is actually a tree structure, with the topmost

Path node representing the last-applied join method. It has left and right

Subpaths that represent the scan or join methods used for the two input

Relations.

The access path of join rel is actually a tree structure, and the path node at the top level represents the best connection method. The tree has left and right subpaths (subpath) to indicate the scanning or connecting method of two relations.

II. Join Tree Construction

Join Tree Construction

Construction of connection tree

The optimizer generates optimal query plans by doing a more-or-less

Exhaustive search through the ways of executing the query. The best Path

Tree is found by a recursive process:

The optimizer generates the optimal query execution plan through exhaustive search as far as possible, and the optimal access path tree is realized by the following recursive process:

1). Take each base relation in the query, and make a RelOptInfo structure

For it. Find each potentially useful way of accessing the relation

Including sequential and index scans, and make Paths representing those

Ways. All the Paths made for a given relation are placed in its

RelOptInfo.pathlist. (Actually, we discard Paths that are obviously

Inferior alternatives before they ever get into the pathlist-what

Ends up in the pathlist is the cheapest way of generating each potentially

Useful sort ordering and parameterization of the relation.) Also create a

RelOptInfo.joininfo list including all the join clauses that involve this

Relation. For example, the WHERE clause "tab1.col1 = tab2.col1" generates

Entries in both tab1 and tab2's joininfo lists.

1)。 Generate a RelOptInfo structure for each underlying relationship in the query. Generate sequential scan or index scan access paths for each underlying relationship. These generated access paths are stored in the RelOptInfo.pathlist linked list (in fact, the optimizer has abandoned obviously unreasonable access paths in the process, and the path in pathlist is the most likely path to generate sorting paths and parameterized Relation). During this time, a RelOptInfo.joininfo linked list is generated to hold the join statement (join clauses) of the index associated with this Relation. For example, the WHERE statement "tab1.col1 = tab2.col1" produces the corresponding data structure (entries) in the joininfo linked list of tab1 and tab2.

If we have only a single base relation in the query, we are done.

Otherwise we have to figure out how to join the base relations into a

Single join relation.

If there is only one basic relationship in the query, the optimizer has done all the work, otherwise, the optimizer needs to figure out how to join the basic relationship in order to get a new relationship (through a join join).

2). Normally, any explicit JOIN clauses are "flattened" so that we just

Have a list of relations to join. However, FULL OUTER JOIN clauses are

Never flattened, and other kinds of JOIN might not be either, if the

Flattening process is stopped by join_collapse_limit or from_collapse_limit

Restrictions. Therefore, we end up with a planning problem that contains

Lists of relations to be joined in any order, where any individual item

Might be a sub-list that has to be joined together before we can consider

Joining it to its siblings. We process these sub-problems recursively

Bottom up. Note that the join list structure constrains the possible join

Orders, but it doesn't constrain the join implementation method at each

Join (nestloop, merge, hash), nor does it say which rel is considered outer

Or inner at each join. We consider all these possibilities in building

Paths. We generate a Path for each feasible join method, and select the

Cheapest Path.

2) generally speaking, explicit JOIN statements have been flattened, and the optimizer can join directly according to the relational list. However, full external connections (FULL OUTER JOIN) and some types of JOIN cannot be flattened (for example, due to constraints such as join_collapse_limit or from_collapse_limit). There is a problem here: for relational linked lists that try to join in any order, the sublinked lists in the linked list must be joined before pairwise joins. The optimizer deals with these sub-problems recursively in a bottom-up manner. Note: the join list restricts the join order, but does not limit how the join is implemented or that the relationship is inner or outer. These problems are solved when generating the access path. The optimizer generates an access path for each possible connection and selects the one with the lowest cost.

For each planning problem, therefore, we will have a list of relations

That are either base rels or joinrels constructed per sub-join-lists.

We can join these rels together in any order the planner sees fit.

The standard (non-GEQO) planner does this as follows:

For each problem that occurs during the planning process, the optimizer combines each of the subjoin list (sub-join-list)

The underlying relationship or connection relationship is stored in a linked list. The optimizer connects these relationships in the order in which they seem appropriate. The standard (non-genetic algorithm, that is, dynamic programming algorithm) planner does the following:

Consider joining each RelOptInfo to each other RelOptInfo for which there

Is a usable joinclause, and generate a Path for each possible join method

For each such pair. (If we have a RelOptInfo with no join clauses, we have

No choice but to generate a clauseless Cartesian-product join; so we

Consider joining that rel to each other available rel. But in the presence

Of join clauses we will only consider joins that use available join

Clauses. Note that join-order restrictions induced by outer joins and

IN/EXISTS clauses are also checked, to ensure that we find a workable join

Order in cases where those restrictions force a clauseless join to be done.)

For each available connection condition, consider using pairwise connections to generate access paths for each pair of RelOptInfo connections. If there is no connection condition, Cartesian connections will be used, so priority will be given to connecting to other available Relation.

If we only had two relations in the list, we are done: we just pick

The cheapest path for the join RelOptInfo. If we had more than two, we now

Need to consider ways of joining join RelOptInfos to each other to make

Join RelOptInfos that represent more than two list items.

If there are only two Relations in the linked list, the optimizer has done all the work, and you can choose the best access path for the RelOptInfo participating in the connection. Otherwise (> 3 Relations), the optimizer needs to consider how to connect in pairs.

The join tree is constructed using a "dynamic programming" algorithm:

In the first pass (already described) we consider ways to create join rels

Representing exactly two list items. The second pass considers ways

To make join rels that represent exactly three list items; the next pass

Four items, etc. The last pass considers how to make the final join

Relation that includes all list items-obviously there can be only one

Join rel at this top level, whereas there can be more than one join rel

At lower levels. At each level we use joins that follow available join

Clauses, if possible, just as described for the first level.

The connection tree is constructed by the dynamic programming algorithm:

In the first round (previously described), the optimizer has completed the connection of two Relations; in the second round, the optimizer considers how to create three Relations join rels; for the next round, four Relations, and so on. In the last round, consider how to construct a join rel that contains all the Relations. Obviously, at the top level, there is only one join rel, while at the lower level there may be multiple join rel. At each level, as mentioned earlier, the optimizer connects according to the available connection conditions, if possible.

For example:

SELECT *

FROM tab1, tab2, tab3, tab4

WHERE tab1.col = tab2.col AND

Tab2.col = tab3.col AND

Tab3.col = tab4.col

Tables 1, 2, 3, and 4 are joined as:

{1 2}, {2 3}, {3 4}

{1 2 3}, {2 3 4}

{1 2 3 4}

(other possibilities will be excluded for lack of join clauses)

SELECT *

FROM tab1, tab2, tab3, tab4

WHERE tab1.col = tab2.col AND

Tab1.col = tab3.col AND

Tab1.col = tab4.col

Tables 1, 2, 3, and 4 are joined as:

{1 2}, {1 3}, {1 4}

{1 2 3}, {1 3 4}, {1 2 4}

{1 2 3 4}

We consider left-handed plans (the outer rel of an upper join is a joinrel

But the inner is always a single list item); right-handed plans (outer rel

Is always a single item); and bushy plans (both inner and outer can be

Joins themselves). For example, when building {1 2 3 4} we consider

Joining {1 2 3} to {4} (left-handed), {4} to {1 2 3} (right-handed), and

{12} to {34} (bushy), among other choices. Although the jointree

Scanning code produces these potential join combinations one at a time

All the ways to produce the same set of joined base rels will share the

Same RelOptInfo, so the paths produced from different join combinations

That produce equivalent joinrels will compete in add_path ().

Let's take a look at the left-handed plan, the bushy plan and the right-handed plan. For example, when building a connection of {1 2 3 4} 4 relationships, among the many choices, there are the following ways: left-handed: {1 2 3} connection {4}, right-handed: {4} connection {12 3} and bushy: {12} connection {3 4}. Although these potential connection combinations are generated at once when scanning the connection tree, all methods that produce the same connection base rels set share the same RelOptInfo data structure, so these different connection combinations compete with each other when calling the add_path method when generating equivalent join rel.

The dynamic-programming approach has an important property that's not

Immediately obvious: we will finish constructing all paths for a given

Relation before we construct any paths for relations containing that rel.

This means that we can reliably identify the "cheapest path" for each rel

Before higher-level relations need to know that. Also, we can safely

Discard a path when we find that another path for the same rel is better

Without worrying that maybe there is already a reference to that path in

Some higher-level join path. Without this, memory management for paths

Would be much more complicated.

Dynamic programming has an important feature (bottom-up): before constructing the access path of the high-level RelOptInfo, the access path of the lower-level RelOptInfo is clear, and the optimizer ensures that the access path is the lowest cost.

Once we have built the final join rel, we use either the cheapest path

For it or the cheapest path with the desired ordering (if that's cheaper

Than applying a sort to the cheapest other path).

Once you have completed the construction of the final result join rel, there are two paths: the lowest cost or the lowest requirement for sorting

If the query contains one-sided outer joins (LEFT or RIGHT joins), or

IN or EXISTS WHERE clauses that were converted to semijoins or antijoins

Then some of the possible join orders may be illegal. These are excluded

By having join_is_legal consult a side list of such "special" joins to see

Whether a proposed join is illegal. (The same consultation allows it to

See which join style should be applied for a valid join, ie, JOIN_INNER

JOIN_LEFT, etc.)

If there are outer joins or IN or EXISTS statements converted to semi-joins or disjoins in query statements, then some possible join orders are illegal. The optimizer handles it in additional ways.

These are all the contents of the article "implementation Logic of query_planner Master Planning function make_one_rel in PostgreSQL". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.