What is the query processing process of PostgreSQL? 07/12 Update SLTechnology News&Howtos

What is the query processing process of PostgreSQL?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the query processing process of PostgreSQL". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the query processing process of PostgreSQL".

I. introduction

Database query processing (Query Processing) is not only the core technology of the database, but also the nearest subsystem to the user. In addition to the isolation of transactions, the database system also needs to achieve a certain degree of compatibility on SQL, because the database itself is doing query processing, and many kernel modules work to support this function.

Relational Algebra and SQL (structured query language)

What you learn in school is probably more relational algebra (Relational Algebra), which defines a set of operators that operate on Relation. The Operand of relational algebra is a relation (that is, a two-dimensional table in a database), and the result is also a relation. Operators include the following categories:

Set operators: intersection, union, difference

Filter / projection

Connect

Alias (alias)

Some extended operators, such as grouping, deduplication, Aggregate.

In addition to relational algebra, there is another way to describe two-dimensional relational tables: DataLog (Database Logic). This method is relatively powerful and can be used to express operators in relational algebra, but some relational operations can only be expressed in DataLog, such as recursive query.

Using relational algebra directly is more obscure and difficult for database operation. Therefore, today's commercial databases have implemented a more advanced query language-SQL (Structured Query Language), which is more concise and easier to understand and easier to learn.

In fact, within the database system, SQL statements are also converted into operators of the corresponding relational algebra and then processed, except that the work is invisible to the end user. In fact, the direct "local language" of relational database is relational algebra, and SQL language is only a "more convenient" bridge for human beings to communicate with relational database.

You may be wondering why SQL is used as a communication bridge instead of C, Java or Python as the query language for the database.

Because a shorter SQL can do the work of thousands of rows of C or Java, especially when accessing some hierarchical data models (for example, Oracle's hierarchical query, a statement can output the hierarchical structure; PostgreSQL's WITH-RECURSIVE statement can do a similar function).

More importantly, when implementing SQL queries, the database kernel can optimize SQL specifically to produce more efficient access methods, which are functions that high-level languages are unlikely to have.

3. PostgreSQL query processing flow

A SQL statement is sent from the user to the client, which is transmitted to PostgreSQL through the network for processing and execution. The process goes through the following steps:

1. Grammatical analysis

The SQL string can be thought of as a large regular, which is parsed to check whether the large "regular" is a match-defined rule.

In PostgreSQL, pg_parse_query is the entry function of parsing, and syntax checking is actually done by scan.l (Flex file) and gram.y (Bison file).

Scan.l is lexical analysis, which decomposes the input SQL into Token one by one and inputs it into gram.y for rule matching. The syntax rules for all SQL types and the precedence and association rules for operators are defined in gram.y. For example, the following code defines the precedence and association rules for operators:

The following code sets the syntax rules:

After parsing, take the query (SELECT) as an example, and the returned structure is SelectStmt, which is used as input to the semantic analysis module. SelectStmt saves the grammatical subparts of the SQL statement, such as from clause, projection column, group clause, and so on. You can see more details from its definition:

2. Grammar check

The parse_analyze () function is the entry function for this step, calling the transformXXXXStmt () function for analysis and processing according to different statement types. For SelectStmt, the transformSelectStmt () is called, and transformDeleteStmt () is called for DeleteStmt. In this step, it will:

Check whether the table exists and whether the column is legal, and convert the table, permutation, projection column, etc., to the internal object ID

Whether the semantics of SQL are correct and legal.

For example, the Aggregate function cannot be used in WHERE. The query is as follows:

Select 1 from x where max (x2) > 1

The adjustment aggregation function is calculated at the appropriate level, as shown in the following query:

Select (select max (x.x2) from y) from x

Max (x.x2) should be semantically evaluated in the outermost query of the SQL, rather than passing x.x2 into the inner subquery, where the value of the Aggregate function max () should be calculated. For the following query:

Select (select max (x.x2+y.x2) from y) from x

Max (x.x2+y.x2) is evaluated in the inner subquery, not as the Aggregate function of the outer query.

After semantic checking, the SelectStmt is transformed into a Query structure as input for query rewriting. The Query structure contains parts similar to SelectStmt, except that the content is richer:

What is saved is the object information inside the database.

Some flag tags indicating whether to include: Aggregate function, window function, Sublink subquery, etc.

The Query hierarchy in which the expression is located is determined.

As mentioned earlier, the database kernel transforms SQL into relational algebra-related elements, as you can see in the Query structure:

For example:

The projection of relational algebra is: targetList

The filter / join of relational algebra is: jointree

The Aggregate of relational algebra is: targetList

Grouping of Relational Algebra: groupClause

The sort of relational algebra is: sortClause.

All subsequent work is based on the above elements.

3. Query rewriting

Rewriting a query according to user-defined rules actually modifies or replaces members in the Query structure, which can be created using CREATE RULE. If the user has no rules on the corresponding table of the query, this step is skipped.

4. Query optimization

Query optimization is a complex subsystem, which is usually called "optimizer" and is also used to measure the excellence of the database system. Another complex subsystem in the database domain is transaction processing, which is not expanded here.

The input of PostgreSQL in this step is the Query object, the entry function is planner (), and the output query plan (Query Plan). The query plan is a structure that guides how the query is executed and how to execute it, usually a tree structure.

The main work of the optimizer is to select a better execution algorithm and output a better execution plan for each syntax part of the Query structure. In PostgreSQL, it is usually divided into the following steps:

1) subquery processing

There are two types of subqueries within PostgreSQL: one is called SubQuery after the from statement, and the other, as part of the expression, can appear in targetList, filter criteria, join conditions, called sub-link.

Both of these can be collectively referred to as Sub-Select, and the optimizer does Sub-Select Elimination at this step: pull the subquery up to the top-level query and eliminate the subquery.

This can reduce the number of query layers and increase the number of upper tables, thus increasing the search space of join order, which is helpful to find a better join order. Take sub-link as an example to illustrate the work of this step. For queries:

Select * from x where x.x2 in (select y.x2 from y)

In this step, PostgreSQL can transform the IN statement into Semi-Join, and the original O (m\ n) search algorithm is simplified to O (1) HASH-JOIN algorithm.

Hash Semi-Join is not used in the execution plan here, because inner plantree uses group hashagg to remove duplicates, so the original Semi-Join can be further optimized to Hash Join, which further expands the Join sequential search space.

2) perform expression preprocessing

In this step, the targetList, filter criteria and other columns will be modified to reference the base table; the SubLink recursive call optimizer in the expression will be optimized first; the constant expression in the expression will be evaluated.

3) remove useless GROUP BY columns

If the kernel can determine that some of the attribute set Y functions in GROUP BY depend on other attribute set X, you can delete the attribute set Y in GROUP BY. Functional dependency checking is done by check_functional_grouping. This can reduce the cost of grouping calculation.

4) Reduce outer join

Convert some OUTER JOIN to INNER JOIN.

5) choose the optimized Join order

In this step, the main tasks are as follows: pushing down the conditions, generating equivalence classes based on connection conditions, and selecting a better JOIN order through dynamic programming. On the whole, the choice of JOIN order is Condition-Driven, rather than completely permuting and combining all the tables. For example, for a query:

Select * from r, p, Q where R1 = (p1+q1) and r2=q2

In general, we might think that r and Q are connected under the condition of r2=q2, and then to p on R1 = (p1+q1); but the PostgreSQL kernel will also try to product join p and Q and then connect to r under the condition R1 = (p1+q1) and R2 to Q2; to connect, the connection between p and Q is entirely determined by R1 = (p1+q1).

6) other clause optimization processing

After finishing the Join Plan, we will deal with the GROUP BY, Aggregate, ORDER BY, LIMIT and other clauses. Take GROUP BY as an example, within PostgreSQL, there are two algorithms to realize GROUP BY: Sort Group By and HashAgg Group By. The costs of Sort Group By and HashAgg Group By are calculated by functions cost_group and cost_agg respectively, and the better algorithm is selected for execution.

After completing these steps, after calling the set_plan_references () and SS_finalize_plan () functions to finally process the parameters and variable references, you can output the final query plan (Execution Query Plan). The query plan consists of many nodes: projection, scanning, join, Aggregate, GROUP BY, sorting, and so on. These names also show that they are the operators of relational algebra, and they are passed to the query execution component for execution. The following example of a query plan:

5. Query execution

This is the last step in query processing, initializing and executing the execution plan output by the optimizer. Query execution subsystem is generally called executor. The execution process includes three entry functions: ExecutorStart, ExecutorRun and ExecutorFinish, which respectively complete the initialization, execution and cleaning of the query plan. In this process, other subsystems of the database are accessed, such as transaction system, storage system, and log system.

This is the entire life cycle of query processing in the PostgreSQL kernel. You can basically understand how a SQL string is parsed step by step in the database kernel until it is executed.

Some of the methods and theories described above are not only effective in PostgreSQL databases, but can also be deduced to other database systems.

4. SeqScan of PostgreSQL actuator algorithm

The above describes the basic flow of query processing in the database kernel, and now let's start with the executor algorithm.

The executor of the database contains many operator execution algorithms, and a relatively simple one is SeqScan, which scans the tables in order (usually in storage order).

1. Page structure

PostgreSQL page storage is similar to that of most databases, including: page headers, ItemId arrays, and Item (tuples), with the following layout:

Where PageHeader contains the page offset (pd_lower) of the last element of the page LSN,ItemId array, the first tuple offset within the page (pd_upper), and other fields.

2. Sequential scanning algorithm

The entry function of PostgreSQL's sequential scan is SeqNext. Each execution of this function returns a tuple, which is mainly done by heapgettup:

1) initialize the scanning process

Initializing the scanning process is to set the HeapScanDesc object, mainly to set the initial scanned page, usually starting from the first tuple of page 0, that is, scan- > rs_startblock is 0.

There is an optimization in the scanning process of PostgreSQL, namely sync_scan, this feature allows the current scan to start from the middle page of the table, which is filled by other scanning processes into the shared memory and completed by ss_report_location, indicating that these pages have just been visited. If the current scanning starts from these pages, it can be accessed directly in memory, thus reducing the number of IO pages read by storage and improving performance.

Each time you update the sync start page of a table, you need to traverse the entire list. To reduce visits to this list, update the list every SYNC_SCAN_REPORT_INTERVAL page, which is 128x1024 / BLCKSZ.

2) read the page for scanning

Scan the page according to its structure. First read the pd_linp member of the page header (PageHeaderData), which is an Offset array (ItemIdData) that records the lp_off of the tuple on the page.

The subsequent main logic is to traverse the pd_linp array and get the tuple memory address through the offset+page address. Then make a visibility judgment on the tuple. The logic is as follows:

HeapTupleSatisfiesVisibility determines the visibility of tuples. PostgreSQL is the transaction isolation implemented by MVCC. This function is the entry logic of MVCC.

3) read the next page and continue scanning

Continue to read subsequent pages for scanning.

All scan states are saved in HeapScanDesc, and the next scan can start from the last state.

Thank you for reading, the above is the content of "what is the query processing process of PostgreSQL". After the study of this article, I believe you have a deeper understanding of what the query processing process of PostgreSQL is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.