How to master SQL subquery optimization 07/19 Update SLTechnology News&Howtos

How to master SQL subquery optimization

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to master SQL sub-query optimization". In daily operation, I believe many people have doubts about how to master SQL sub-query optimization. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt of "how to master SQL sub-query optimization"! Next, please follow the editor to study!

Introduction to subquery

A subquery is a syntax defined in the SQL standard, which can appear almost anywhere in SQL, including SELECT, FROM, WHERE clauses, and so on.

Generally speaking, subqueries can be divided into related subqueries (Correlated Subquery) and unrelated subqueries (Non-correlated Subquery). The latter non-associative subquery is a very simple problem. The simplest thing is to execute it first, get the result set and materialize it, and then execute the outer query. Here is an example:

SELECT c_count, count (*) AS custdist FROM (SELECT c_custkey, count (o_orderkey) AS c_count FROM CUSTOMER LEFT OUTER JOIN ORDERS ON c_custkey = o_custkey AND o_comment NOT LIKE'% pending%deposits%' GROUP BY c_custkey) c_orders GROUP BY c_count ORDER BY custdist DESC, c_count DESC

▲ TPCH-13 is an unrelated subquery

Non-related subqueries are outside the scope of this article unless otherwise stated that the subqueries we are talking about below refer to related subqueries.

What is special about the associated subquery is that it is incomplete in itself: its closure contains some parameters provided by the outer query. Obviously, you can only run the query if you know these parameters, so we can't do it like a non-associative subquery.

Classified according to the generated data, subqueries can be divided into the following categories:

Scalar-valued subquery: outputs a result table with only one row and one column, and this scalar value is its result. If the result is empty (0 lines), an NULL is output. Note, however, that more than 1 row of results are not allowed and will generate a run-time exception.

Scalar quantum queries can appear anywhere that contains scalars, such as SELECT, WHERE, and so on. Here is an example:

SELECT c_custkey FROM CUSTOMER WHERE 1000000

< ( SELECT SUM(o_totalprice) FROM ORDERS WHERE o_custkey = c_custkey ) ▲ Query 1: 一个出现在 WHERE 子句中的标量子查询，关联参数用红色字体标明了 SELECT o_orderkey, ( SELECT c_name FROM CUSTOMER WHERE c_custkey = o_custkey ) AS c_name FROM ORDERS ▲ Query 2: 一个出现在 SELECT 子句中的标量子查询存在性检测（Existential Test）子查询：特指 EXISTS 的子查询，返回一个布尔值。如果出现在 WHERE 中，这就是我们熟悉的 Semi-Join。当然，它可能出现在任何可以放布尔值的地方。 SELECT c_custkey FROM CUSTOMER WHERE c_nationkey = 86 AND EXISTS( SELECT * FROM ORDERS WHERE o_custkey = c_custkey ) ▲ Query 3: 一个 Semi-Join 的例子集合比较（Quantified Comparision）子查询：特指 IN、SOME、ANY 的查询，返回一个布尔值，常用的形式有：x = SOME(Q) （等价于 x IN Q）或 X ALL(Q)（等价于 x NOT IN Q）。同上，它可能出现在任何可以放布尔值的地方。 SELECT c_name FROM CUSTOMER WHERE c_nationkey ALL (SELECT s_nationkey FROM SUPPLIER) ▲ Query 4: 一个集合比较的非关联子查询原始执行计划我们以 Query 1 为例，直观地感受一下，为什么说关联子查询的去关联化是十分必要的。下面是 Query 1 的未经去关联化的原始查询计划（Relation Tree）。与其他查询计划不一样的是，我们特地画出了表达式树（Expression Tree），可以清晰地看到：子查询是实际上是挂在 Filter 的条件表达式下面的。 img实际执行时，查询计划执行器（Executor）在执行到 Filter 时，调用表达式执行器（Evaluator）；由于这个条件表达式中包含一个标量子查询，所以 Evaluator 又会调用 Executor 计算标量子查询的结果。这种 Executor - Evaluator - Executor 的交替调用十分低效！考虑到 Filter 上可能会有上百万行数据经过，如果为每行数据都执行一次子查询，那查询执行的总时长显然是不可接受的。 Apply 算子上文说到的 Relation - Expression - Relation 这种交替引用不仅执行性能堪忧，而且，对于优化器也是个麻烦的存在--我们的优化规则都是在匹配并且对 Relation 进行变换，而这里的子查询却藏在 Expression 里，令人无从下手。为此，在开始去关联化之前，我们引入 Apply 算子： Apply 算子（也称作 Correlated Join）接收两个关系树的输入，与一般 Join 不同的是，Apply 的 Inner 输入（图中是右子树）是一个带有参数的关系树。 Apply 的含义用下图右半部分的集合表达式定义：对于 Outer Relation RR 中的每一条数据 rr，计算 Inner Relation E(r)E(r)，输出它们连接（Join）起来的结果 r⊗E(r)r⊗E(r)。Apply 的结果是所有这些结果的并集（本文中说的并集指的是 Bag 语义下的并集，也就是 UNION ALL）。 " Apply 是 SQL Server 的命名，它在 HyPer 的文章中叫做 Correlated Join。它们是完全等价的。考虑到 SQL Server 的文章发表更早、影响更广，本文中都沿用它的命名。根据连接方式（⊗⊗）的不同，Apply 又有 4 种形式： Cross Apply A×A×：这是最基本的形式，行为刚刚我们已经描述过了； Left Outer Apply ALOJALOJ：即使 E(r)E(r) 为空，也生成一个 r∘{NULLs}r∘{NULLs}。 Semi Apply A∃A∃：如果 E(r)E(r) 不为空则返回 rr，否则丢弃； Anti-Semi Apply A∄A∄：如果 E(r)E(r) 为空则返回 rr，否则丢弃；我们用刚刚定义的 Apply 算子来改写之前的例子：把子查询从 Expression 内部提取出来。结果如下：上面的例子中，我们可以肯定 Scalar Agg 子查询有且只有一行结果，所以可以直接转成 Apply。但某些情况下，可能无法肯定子查询一定能返回 0 或 1 行结果（例如，想象一下 Query 2 如果 c_custkey 不是唯一的），为了确保 SQL 语义，还要在 Apply 右边加一个 Max1RowMax1Row 算子： Max1Row(E)=⎧⎩⎨⎪⎪Null,E,error,if |E|=0if |E|=1otherwiseMax1Row(E)={Null,if |E|=0E,if |E|=1error,otherwise 理论上，我们可以将所有的子查询转换成 Apply 算子，一个通用的方法如下：如果某个算子的表达式中出现了子查询，我们就把这个子查询提取到该算子下面（留下一个子查询的结果变量），构成一个 ALOJALOJ 算子。如果不止一个子查询，则会产生多个 ALOJALOJ。必要的时候加上 Max1RowMax1Row 算子。然后应用其他一些规则，将 ALOJALOJ 转换成 A×A×、A∃A∃、A∄A∄。例如上面例子中的子查询结果 XX 被用作 Filter 的过滤条件，NULL 值会被过滤掉，因此可以安全地转换成 A×A×。下面这个例子中，Filter 条件表达式中包含 Q1Q1、Q2Q2 两个子查询。转换之后分别生成了对应的 Apply 算子。其中 Q2Q2 无法确定只会生成恰好一条记录，所以还加上了 Max1RowMax1Row 算子。基本消除规则第一组规则是最基本的规则，等式中的 ⊗⊗ 说明它不限制连接类型，可以是 {×,LOJ,∃,∄}{×,LOJ,∃,∄} 中的任意一个。

These two rules are very obvious, translated into vernacular: if the right side of the Apply does not contain parameters from the left, it is equivalent to the direct Join.

Here is an example of applying rule (2) to Query 3:

Disassociation of Project and Filter

The second set of rules describes how to deal with Project and Filter in a subquery. The idea can be described in one sentence: push Apply down as much as possible and lift up the operators under Apply.

Note that these rules only deal with the Cross Apply case. The other three variants of Apply can theoretically be converted into Cross Apply, and for the time being we just need to know this fact.

You might ask: usually we push Filter and Project down as much as possible, so why do we do the opposite here? The point is: Filter and Project originally contain expressions with associated variables, but after raising it above Apply, the associated variables become normal variables! That's exactly what we want.

We will see the huge benefits of doing so later: when Apply is pushed to the bottom, you can apply the first set of rules to directly change Apply into Join, which completes the optimization process of disassociating subqueries.

The following is an example of applying rule (3) to Query 2. After that, the disassociation process is completed by applying the rule (1).

Disassociation of Aggregate

The third set of rules describes how to handle Aggregate (that is, Group By) in a subquery. As with the previous group, our guiding idea is to push Apply down as much as possible and lift the operators under Apply up as much as possible.

In the following equation, GA,FGA,F represents Group Agg with Group By grouping, where AA represents the column of grouping, FF represents the column of aggregate function, and G1FGF1 represents aggregation without grouping (Scalar Agg).

Img this set of rules is not as simple and straightforward as before, let's take a look at an example to find out. Here is the result of applying Rule (9) to Query 1:

Rule (9) while pushing down Apply, it also turns ScalarAgg into GroupAgg, where the grouping column is the key of R, in this case, the primary key c_custkey of CUSTOMER.

If R doesn't have a primary key or a unique key, theoretically, we can generate one at Scan.

Why is it equivalent before and after transformation? Before the transformation, we do a ScalarAgg aggregation calculation for the rows of each R, and then merge the aggregate results; after the transformation, we first prepare all the data to be aggregated (this is called augment), and then use GroupAgg to do all the aggregation at once.

This also explains why we use ALOJALOJ instead of the original A × A ×: on the original ScalarAgg, a NULL is output even if the input is empty. If we use ALOJALOJ here, we will get exactly the same behavior (*); conversely, if we use A × A ×, there will be a problem-customers without corresponding ORDERS will disappear in the result!

Rule (8) deals with GroupAgg, and the principle is the same, except that the original grouping column should be retained.

Details in ScalarAgg conversion *

Careful readers may have noticed that the aggregate function generated on the right side of rule (9) is Fairchild, with an extra single quotation mark, suggesting that it may be a little different from the original aggregate function FF. So under what circumstances would it be different? This topic is more in-depth, students who are not interested can skip it.

First of all, let's think, is the behavior of GroupAgg and ALOJALOJ exactly the same as before the change? Actually this is not so. Take a counterexample:

SELECT c_custkey, (SELECT COUNT (*) FROM ORDERS WHERE o_custkey = c_custkey) AS count_orders FROM CUSTOMER

Imagine that the customer Eric does not have any orders, then this query should return a line of ['Eric', 0]. However, when we apply the rule (9) to do the transformation, we get a value of ['Eric', 1], which is wrong!

Why is this happening? After the transformation, we first prepare the intermediate data (augment) with LeftOuterJoin, and then aggregate with GroupAgg. LeftOuterJoin generates a ['Eric', NULL, NULL,...] for the customer Eric. In GroupAgg, the aggregate function COUNT (*) assumes that the packet Eric has one row of data, so it outputs ['Eric', 1].

Here is a more complex example, with similar problems:

SELECT c_custkey FROM CUSTOMER WHERE 200000 < (SELECT MAX (IF_NULL (o_totalprice, 42))-o_totalprice may be NULL FROM ORDERS WHERE o_custkey = c_custkey)

To sum up, the root of the problem lies in: F (∅) ≠ F ({NULL}) F (∅) ≠ F ({NULL}), which is the same as the aggregate function FF.

The transformed GroupAgg can not tell whether the NULL data it sees is generated by OuterJoin or already exists. Sometimes, the two situations will produce different results in the pre-transformed ScalarAgg.

Fortunately, the aggregate function F (col) F (col) defined in the SQL standard is OK-they all satisfy F (∅) = F ({NULL}) F (∅) = F ({NULL}), and we can solve this problem with a slight transformation of FF.

For example 1, replace COUNT (*) with a Count for a non-null column (such as a primary key), for example: COUNT (o_orderkey)

For example 2, MIN (IF_NULL (o_totalprice, 42)) needs to be divided into two steps: define the intermediate variable X, first calculate X = IF_NULL (o_totalprice, 42) with Project, and then disassociate the aggregate function MIN (X).

Disassociation of set operation

The last set of optimization rules is used to handle subqueries with Union (corresponding to UNION ALL), Subtract (corresponding to EXCEPT ALL), and Inner Join operators. Again, our guiding idea is to push Apply down as much as possible and lift up the operators under Apply.

In the following equation, × × means Cross Join, and ⋈ R.key ⋈ R.key means to connect naturally according to the Key of RR: r ∘ E1 ∘ e2R ∘ E1 ∘ e2. As before, we assume that there is a primary key or a unique key in RR, and if not, we can add one to Scan.

Notice that these rules are significantly different from those we've seen before: RR appears twice on the right side of the equation. In this way, either we make a copy of the subtree or make an implementation plan for DAG, which will be a lot of trouble.

In fact, this set of rules is rarely useful. As mentioned in [2], it is even difficult to write a meaningful subquery with Union All under TPC-H 's Schema.

Other

There are a few points that I think are important, listed below in the form of FAQ.

Can any related subquery of ► be disassociated?

It can be said that after adding a small amount of qualification, it can be proved theoretically that any associated subquery can be disassociated.

The method of proof is mentioned in [1] and [3]. Take [1] as an example, the idea is roughly as follows:

For any query relation tree, the associated subquery is extracted from the expression and expressed by Apply operator.

The non-basic relation operators are removed step by step. First, Union and Subtract are removed by equivalent transformation.

Further reduce the set of operators by removing OuterJoin, ALOJALOJ, A ∃ A ∃, A ∄ A ∄

Finally, after removing all A × A ×, the remaining relational trees contain only some basic relational operators, that is, the disassociation is completed.

On the other hand, most of the subqueries used by users in the real world are relatively simple, and the rules described in this article may already cover 99% of the scenarios. Although in theory any subquery can be processed, in practice, no known DBMS implements all of these transformation rules.

What are the similarities and differences between ► HyPer and SQL Server?

HyPer's theory covers more discorrelation scenarios. For example, for all kinds of Join operators, the corresponding equivalent transformation rules are given in [3] (as an example, the following figure is the transformation of Outer Join). In [1], it is only proved that these cases can be regulated to manageable situations (in fact, it is conceivable that they must not be dealt with).

Another detail is that there is another rule in HyPer:

Where D = definite F (T2) ∩ A (T1) (T1) D = exact F (T2) ∩ A (T1) (T1), which represents the Distinct Project result for T1T1 (so-called _ Magic Set_). Looking directly at the equation is more obscure, and it is easy to understand by looking at the following example:

In the figure, before doing Apply, get the set of Distinct values of the columns that need Apply, take these values as Apply, and then connect the results of Apply with normal Join.

The advantage of this is that if there is a large number of duplicates in the data being Apply, the number of rows that need Apply after Distinct Project is greatly reduced. In this way, even if the Apply is not optimized later, the cost of iterative execution will be greatly reduced.

At this point, the study on "how to master SQL subquery optimization" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.