Example Analysis of PostgreSQL Optimizer 07/06 Update SLTechnology News&Howtos

Example Analysis of PostgreSQL Optimizer

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of PostgreSQL optimizer, which is very detailed and has certain reference value. Friends who are interested must finish it!

Brief introduction

PostgreSQL was developed in the 1980s and was originally a POSTGRE project created by Michael Stonebraker and others with the support of the U.S. Department of Defense. At the end of the last century, Andrew Yu and others built the first SQL Parser on it, and this version is called Postgre95, which is also the cornerstone of the UC Berkeley version of PostgreSQL.

The optimizer code we see today is mainly contributed by Tom Lane over the past 20 years. Surprisingly, the changes over the past 20 years have been consistent, and Tom Lane himself is worthy of the title of "Top Ten Outstanding contributors to Open Source Software".

But from today's perspective, PostgreSQL optimizer is not a good implementation, it is implemented in C language, so its scalability is not good; it is not Volcano optimization model [2], so its flexibility is not good; many of its optimization complexity is very high (for example, Join rearrangement is a System R [3] style dynamic programming algorithm), so its performance is not good.

In any case, PostgreSQL is an excellent implementation and source of innovation for the optimizer (imagine a series of projects such as Greenplum and ORCA), and some of its optimization tools and code structures are still worth learning today, including:

Parameterized paths, acting on indexed lookup join

Partition clipping and parallel optimization

Strong consistent cardinality estimation guarantee

This article attempts to take a quick look at the code of the PostgreSQL optimizer and compare it with the modern optimizer. Most of the PostgreSQL optimizer code comes from https://github.com/postgres/postgres/tree/master/src/backend/optimizer. When we mention the modern optimizer, we mainly refer to Apache Calcite and ORCA.

Terminology interpretation

Datum

Qual

Path

Key data structure query

_ _ Query__: Parse Tree, input to the optimizer

A node of _ _ RangeTblEntry__: Parse Tree that describes the view of a dataset that may come from a subquery, Join, Values, or any simple relational algebraic expression. The Join implementation needs to express all its inputs as RangeTblEntry (hereinafter referred to as RTE).

Carry out the plan

_ _ the top node of the PlannedStmt__: execution plan

Context information for the _ _ PlannerInfo__: optimizer. It is a tree structure that points to the parent node with the parent_root variable. A Query contains one or more PlannerInfo, and each Join is segmented into tree nodes. It contains a pointer to the RelOptInfo.

The core data structure of the _ _ RelOptInfo__: optimizer, which contains information such as the Path collection of a subquery. This concept corresponds to Group in ORCA or Set in Calcite.

_ _ Path__: is different from Parser that calls Relational Expression Node,Optimizer and the relational algebra of optimization is Path. Path is a physical plan that contains the optimizer's understanding of a single relational algebra, including parallelism, PathKey, and cost.

_ _ PathKey__: sort property. This concept is equivalent to Physical Property in Volcano or Trait in Calcite. Because PostgreSQL is a stand-alone database, only sorting attributes can be used to express the requirements and implementation characteristics of all algorithms. For distributed databases, you usually also need to distribute attributes.

Main process

Subquery pull-up

Because the optimized unit (RelOptInfo) is a subquery, merging subqueries can simplify the optimization process. The associated processes include:

_ _ pull_up_sublinks__: converts a convertible ANY/EXISTS clause to (anti-) semi-join. Some optimizers call this process de-correlation.

_ _ pull_up_subqueries__: pulls up the subquery that can be pulled up to the current query and deletes the original subquery. If the subquery is a Join, this operation is equivalent to leveling binary join to multi join.

EquivalenceClass parsing

Equivalence Class (EC) is a term for qual, which refers to the equivalence of expression. For example, expression

A = b AND b = c

Then {a, b, c} is an EC. In particular, in PostgreSQL, expression

A = b AND b = 5

Only generate a simplified EC: {a = 5} {b = 5}

EC is a key data structure, and its application scenarios include:

In Join, EC is used to determine Join Key, which determines Outer Join simplification and PathKey settings

Decide on qual traversing in Join

The parameter list that determines the parameterized path

Match primary-foreign key constraints to optimize (Join's) cardinality estimation

EC is a tree structure, with each node being an EC and linked to its merged parent node. Consider the example of a = b AND b = c, and the final EC tree is expressed as

{a, b, c} |-{a, b} |-{b, c}

Where the expression inside each EC is called EquivalenceMember (EM).

The entry to generate the EC is generate_base_implied_equalities, which is called in from query_planner. That is, EC is generated just before planning Join, which is the least expensive to parse EC at this stage, but it also determines that EC can only be applied to Join optimization.

Join rearrangement

(for background information on Join rearrangement, please refer to my previous article SQL Optimizer principle-Join rearrangement)

Make_rel_from_joinlist is the entry point for Join rearrangement. There are three algorithms for the current version of PostgreSQL:

You can insert a custom Join rearrangement algorithm

GEQO: Genetic Optimization (genetic algorithm, or genetic algorithm [4]) is an implementation of an inexhaustible optimization algorithm

Standard: a slightly pruned dynamic programming algorithm.

GEQO is turned on by default in complex Join of 12 or more channels. Parameters can be modified in postgresql.conf

Geqo = ongeqo_threshold = 12

Controls GEQO settings.

Now let's examine the Standard algorithm. Its main entry is join_search_one_level, each time on the basis of the generated local plan:

Press EC to check the unjoined Join input and add it to the generated local plan. This operation produces only Left-deep-tree.

Find two input with EC in Join input that have never joined the local plan, and generate additional local plan to generate Bushy-tree.

If no EC association is found in the current layer, a Cartesian product is generated.

The above description is complex enough, let's summarize the Standard algorithm:

Standard algorithm is still an exhaustive dynamic programming algorithm.

It removes duplicates from a-b/b-an images and does not consider Cartesian product when EC exists. These engineering degradations effectively reduce the search complexity.

Path generation and dynamic planning

As mentioned above, the optimization process focuses on the reconstruction of a subquery (RelOptInfo), which can be understood as a logical optimization process, which is usually a more complex optimization across relational algebraic operators. In fact, PostgreSQL is doing physical optimization at the same time.

Physical optimization is to add Path to RelOptInfo. Consider Join, the entry point for physical optimization is populate_joinrel_with_paths. For each JoinRel (Join RelOptInfo), consider:

Sort_inner_and_outer: the MergeJoin path sorted on both sides

Match_unsorted_outer:Null-generating side does not sort paths, including MergeJoin and NestedLoopJoin.

Hash_inner_and_outer: the HashJoin path of the hash on both sides.

The interesting point is the HashJoin path (hash_inner_and_outer), which, as the name implies, requires hashes to be calculated on both sides of the Join. In the process of generating Path, you need to calculate the parameter information on both sides. For example, A join B on A.x = B.y, for A, x is the parameter, for B is y. If An is selected as Probe side, once there is an index of y on B, each probe of x will be passed to the index of y as a parameter. Parameter information is generated by calling get_joinrel_parampathinfo.

The entry of path generation is add_path. Each time the path is generated, the best path and minimum cost of RelOptInfo need to be updated in order to select the global optimization for subsequent dynamic planning.

Flow chart

Planner |-subquery_planner iterative subquery optimization |-pull_up_sublinks de-correlation |-pull_up_subqueries subquery pull-up |-preprocess_expression Query/PlannerInfo structure parsing Constant folding |-remove_useless_groupby_columns |-reduce_outer_joins Outer Join degradation |-grouping_planner |-plan_set_operations SetOp Optimization |-query_planner subquery Optimization main entry |-generate_base_implied_equalities Generation / merge EC |-make_one_rel Join Optimization entry |-set_base_rel_pathlists Health Form Join RelOptInfo list |-make_rel_from_joinlist Join rearrangement and planning |-standard_join_search standard Join rearrangement algorithm |-join_search_one_level |-make_join_rel generate JoinRel and corresponding Path |-create_XXX_paths Grouping, Discussion on window and other expression optimization

Scalability and flexibility

First of all, PostgreSQL's optimizer code can be said to be very complex, which has greatly limited its extensibility and flexibility. If you take a look at the update log of this part of the code, you will find that there are only a few authors in it.

Some of the extensibility limitations are brought by the programming language, because C language itself is not easy to extend, which means that most of the time it is not easy to add a new Node or Path, you need to define a series of data structures, Cardinality Estimation logic, parallel logic, and Path interpretation logic. There is no abstraction like interface that tells you what to do. Although the PostgreSQL code is neatly written, there are many articles telling you what to do (such as Introduction to Hacking PostgreSQL and The Internals of PostgreSQL).

Another part of the scalability limitation is brought about by the structure of the optimizer itself. Modern optimizers are basically Volcano Model [2] (such as SQL Server and Oracle, as they claim), while PostgreSQL is not implemented as a Generic purpose,pluggable form of Volcano Model. The effects include:

Unable to interoperate between logical and physical optimizations. For example, as mentioned earlier, an EC generated by a Join must be combined with the RTE it follows to produce an IndexedLookupJoin, unlike other optimizers that can push the EC (which in a sense is already a physical plan) down to the appropriate logical plan to guide it to make physical plan transformations.

It is not easy to customize optimization rules.

Developers pay attention to the slice is too large, the development of an optimization rule in addition to focus on the optimization itself, have to learn other optimization rules of the data structure, dynamic planning updates, RelOptInfo new and clean, and even memory allocation itself.

PostgreSQL still provides some handwritten Plugin Point, including:

Customizable Join rearrangement algorithm

Customizable PathKey Generation algorithm

Customized Join Path generation algorithm

Wait.

Performance

Although there are no experiments, the performance of PostgreSQL in optimization can be imagined to be better, which is largely exchanged for flexibility.

First, unlike Volcano Optimizer, the PostgreSQL optimizer does not need to constantly generate intermediate nodes, and its number of RelOptInfo is relatively stable (approximately equal to the number of Join). Its optimal planned search is in RelOptInfo, and if Join rearrangement does not produce a large number of RelOptInfo, the search width is very low.

Second, RelOptInfo simplifies a large number of details of cross-Relational Expression optimization, and its search width is further reduced compared to Calcite, which organizes sets of equivalent paths by Relational Expression. In terms of the number of equivalent sets, the search width of PostgreSQL is about an order of magnitude lower than that of Calcite, which, of course, is exchanged for more optimization possibilities, as mentioned above.

Finally, PostgreSQL mixes a lot of business logic in the optimization phase, which not only improves the difficulty of code reading, but also speeds up the optimization efficiency. In the process of optimization, PostgreSQL will continuously do constant folding, PathKey de-duplication, Union leveling, subquery leveling. These operations will not be applied to memo.

Compared with Calcite/Orca, PostgreSQL optimizes faster and is more suitable for transactional scenarios. However, I can't tell whether Calcite/Orca can also support transaction scenarios after a proper mix of pruning and optimization rules.

The above is all the content of the article "sample Analysis of PostgreSQL Optimizer". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.