Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand SQL subquery optimization

2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to understand SQL subquery optimization". In daily operation, I believe many people have doubts about how to understand SQL subquery optimization. Xiaobian consulted all kinds of materials and sorted out simple and easy operation methods. I hope to help you answer the doubts about "how to understand SQL subquery optimization"! Next, please follow the small series to learn together!

Subquery optimization has always been one of the difficulties in SQL query optimization. The basic execution of associative subqueries is similar to Nested-Loop, but this execution is often unacceptably inefficient. When the data volume is slightly larger, it must be decorrelated or Unnesting in the optimizer, rewriting it to a more efficient operator like Semi-Join.

Introduction to Subquery

A subquery is a syntax defined in the SQL standard that can appear almost anywhere in SQL, including SELECT, FROM, WHERE clauses, etc.

Generally speaking, subqueries can be divided into correlated subqueries and non-correlated subqueries. The latter is a very simple problem, the simplest is to execute it first, get the result set and materialize, and then execute the outer query. Here is an example:

SELECT c_count, count(*) AS custdist FROM ( SELECT c_custkey, count(o_orderkey) AS c_count FROM CUSTOMER LEFT OUTER JOIN ORDERS ON c_custkey = o_custkey AND o_comment NOT LIKE '%pending%deposits%' GROUP BY c_custkey ) c_orders GROUP BY c_count ORDER BY custdist DESC, c_count DESC;

TPCH-13 is an uncorrelated subquery.

Non-associative subqueries are outside the scope of this article, and unless otherwise stated, the subqueries we refer to below are associative subqueries.

What is special about associative subqueries is that they are incomplete: their closures contain parameters supplied by the outer query. Obviously, we have to know these parameters to run the query, so we can't do the same with non-associative subqueries.

Classified according to the data generated, subqueries can be divided into the following categories:

Scalar-valued subquery: Outputs a result table with only one row and one column, and this scalar value is its result. If the result is null (row 0), a NULL is output. Note, however, that more than one line result is not allowed and generates a runtime exception.

Scalar queries can appear anywhere they contain scalars, such as SELECT, WHERE, etc. clauses. Here is an example:

SELECT c_custkey FROM CUSTOMER WHERE 1000000

< ( SELECT SUM(o_totalprice) FROM ORDERS WHERE o_custkey = c_custkey ) ▲ Query 1: 一个出现在 WHERE 子句中的标量子查询,关联参数用红色字体标明了 SELECT o_orderkey, ( SELECT c_name FROM CUSTOMER WHERE c_custkey = o_custkey ) AS c_name FROM ORDERS ▲ Query 2: 一个出现在 SELECT 子句中的标量子查询 存在性检测(Existential Test) 子查询:特指 EXISTS 的子查询,返回一个布尔值。如果出现在 WHERE 中,这就是我们熟悉的 Semi-Join。当然,它可能出现在任何可以放布尔值的地方。 SELECT c_custkey FROM CUSTOMER WHERE c_nationkey = 86 AND EXISTS( SELECT * FROM ORDERS WHERE o_custkey = c_custkey ) ▲ Query 3: 一个 Semi-Join 的例子 集合比较(Quantified Comparision) 子查询:特指 IN、SOME、ANY 的查询,返回一个布尔值,常用的形式有:x = SOME(Q) (等价于 x IN Q)或 X ALL(Q)(等价于 x NOT IN Q)。同上,它可能出现在任何可以放布尔值的地方。 SELECT c_name FROM CUSTOMER WHERE c_nationkey ALL (SELECT s_nationkey FROM SUPPLIER) ▲ Query 4: 一个集合比较的非关联子查询 原始执行计划 我们以 Query 1 为例,直观地感受一下,为什么说关联子查询的去关联化是十分必要的。 下面是 Query 1 的未经去关联化的原始查询计划(Relation Tree)。与其他查询计划不一样的是,我们特地画出了表达式树(Expression Tree),可以清晰地看到:子查询是实际上是挂在 Filter 的条件表达式下面的。 img实际执行时,查询计划执行器(Executor)在执行到 Filter 时,调用表达式执行器(Evaluator);由于这个条件表达式中包含一个标量子查询,所以 Evaluator 又会调用 Executor 计算标量子查询的结果。 这种 Executor - Evaluator - Executor 的交替调用十分低效!考虑到 Filter 上可能会有上百万行数据经过,如果为每行数据都执行一次子查询,那查询执行的总时长显然是不可接受的。 Apply 算子 上文说到的 Relation - Expression - Relation 这种交替引用不仅执行性能堪忧,而且,对于优化器也是个麻烦的存在——我们的优化规则都是在匹配并且对 Relation 进行变换,而这里的子查询却藏在 Expression 里,令人无从下手。 为此,在开始去关联化之前,我们引入 Apply 算子: Apply 算子(也称作 Correlated Join)接收两个关系树的输入,与一般 Join 不同的是,Apply 的 Inner 输入(图中是右子树)是一个带有参数的关系树。 Apply 的含义用下图右半部分的集合表达式定义:对于 Outer Relation RR 中的每一条数据 rr,计算 Inner Relation E(r)E(r),输出它们连接(Join)起来的结果 r⊗E(r)r⊗E(r)。Apply 的结果是所有这些结果的并集(本文中说的并集指的是 Bag 语义下的并集,也就是 UNION ALL)。 " Apply 是 SQL Server 的命名,它在 HyPer 的文章中叫做 Correlated Join。它们是完全等价的。考虑到 SQL Server 的文章发表更早、影响更广,本文中都沿用它的命名。 根据连接方式(⊗⊗)的不同,Apply 又有 4 种形式: Cross Apply A×A×:这是最基本的形式,行为刚刚我们已经描述过了; Left Outer Apply ALOJALOJ:即使 E(r)E(r) 为空,也生成一个 r°{NULLs}r°{NULLs}。 Semi Apply A∃A∃:如果 E(r)E(r) 不为空则返回 rr,否则丢弃; Anti-Semi Apply A∄A∄:如果 E(r)E(r) 为空则返回 rr,否则丢弃; 我们用刚刚定义的 Apply 算子来改写之前的例子:把子查询从 Expression 内部提取出来。结果如下: 上面的例子中,我们可以肯定 Scalar Agg 子查询有且只有一行结果,所以可以直接转成 Apply。但某些情况下,可能无法肯定子查询一定能返回 0 或 1 行结果(例如,想象一下 Query 2 如果 c_custkey 不是唯一的),为了确保 SQL 语义,还要在 Apply 右边加一个 Max1RowMax1Row 算子: Max1Row(E)=⎧⎩⎨⎪⎪Null,E,error,if |E|=0if |E|=1otherwiseMax1Row(E)={Null,if |E|=0E,if |E|=1error,otherwise 理论上,我们可以将所有的子查询转换成 Apply 算子,一个通用的方法如下: 1. 如果某个算子的表达式中出现了子查询,我们就把这个子查询提取到该算子下面(留下一个子查询的结果变量),构成一个 ALOJALOJ 算子。如果不止一个子查询,则会产生多个 ALOJALOJ。必要的时候加上 Max1RowMax1Row 算子。 2. 然后应用其他一些规则,将 ALOJALOJ 转换成 A×A×、A∃A∃、A∄A∄。例如上面例子中的子查询结果 XX 被用作 Filter 的过滤条件,NULL 值会被过滤掉,因此可以安全地转换成 A×A×。 下面这个例子中,Filter 条件表达式中包含 Q1Q1、Q2Q2 两个子查询。转换之后分别生成了对应的 Apply 算子。其中 Q2Q2 无法确定只会生成恰好一条记录,所以还加上了 Max1RowMax1Row 算子。 基本消除规则 第一组规则是最基本的规则,等式中的 ⊗⊗ 说明它不限制连接类型,可以是 {×,LOJ,∃,∄}{×,LOJ,∃,∄} 中的任意一个。 这两条规则是非常显而易见的,翻译成大白话就是:如果 Apply 的右边不包含来自左边的参数,那它就和直接 Join 是等价的。 下面是对 Query 3 应用规则 (2) 的例子: Project 和 Filter 的去关联化 第二组规则描述了如何处理子查询中的 Project 和 Filter,其思想可以用一句话来描述:尽可能把 Apply 往下推、把 Apply 下面的算子向上提。 注意这些规则仅处理 Cross Apply 这一种情况。其他 3 种 Apply 的变体,理论上都可以转换成 Cross Apply,暂时我们只要知道这个事实就可以了。 你可能会问:通常我们都是尽可能把 Filter、Project 往下推,为什么这里会反其道而行呢?关键在于:Filter、Project 里面原本包含了带有关联变量的表达式,但是把它提到 Apply 上方之后,关联变量就变成普通变量了!这正是我们想要的。 我们稍后就会看到这样做的巨大收益:当 Apply 被推最下面时,就可以应用第一组规则,直接把 Apply 变成 Join,也就完成了子查询去关联化的优化过程。 下面是对 Query 2 应用规则 (3) 的例子。之后再应用规则 (1),就完成了去关联化过程。

De-correlation of aggregates

The third set of rules describes how to handle Aggregates in subqueries (i.e., Group By). As with the previous group, our guiding principle is still: push Apply down as much as possible and lift the operators below Apply up.

In the following equations, GA,FGA,F denotes an aggregation with Group By grouping (Group Agg), where AA denotes a grouped column and FF denotes a column of an aggregation function; G1FGF1 denotes an aggregation without grouping (Scalar Agg).

img This set of rules is not as simple and straightforward as before. Let's look at an example first to find out how it feels. Here is the result of applying rule (9) to Query 1:

Rule (9) also changes ScalarAgg to GroupAgg while pushing down Apply, where the grouping column is the key of R, in this case the primary key c_custkey of CUSTOMER.

"If R doesn't have a primary key or unique key, theoretically we can generate one at Scan time.

Why is it equivalent before and after transformation? Before the transformation, we do a ScalarAgg aggregation for each row of R, and then combine the results of the aggregation; after the transformation, we first prepare all the data to be aggregated (this is called an augmentation), and then use GroupAgg to do all the aggregation at once.

This also explains why we use ALOJALOJ instead of the original A×A×: the original ScalarAgg outputs NULL even if the input is empty. If we use ALOJALOJ here, we get exactly the same behavior (*); conversely, if we use A×A×, there is a problem-customers without corresponding ORDERS disappear from the result!

Rule (8) deals with GroupAgg, the same principle, but the original grouping column should also be retained.

Details in ScalarAgg transformation *

The careful reader may notice that the aggregate function produced on the right hand side of rule (9) is F′F′, with an extra single quote, suggesting that it may be somewhat different from the original aggregate function FF. And under what circumstances would it be different? This topic is more in-depth, not interested students can skip.

First, let's think about whether GroupAgg and ALOJALOJ behave exactly as before the transformation. It's not true. Here's a counterexample:

SELECT c_custkey, ( SELECT COUNT(*) FROM ORDERS WHERE o_custkey = c_custkey ) AS count_orders FROM CUSTOMER

Imagine that customer Eric doesn't have any orders, so this query should return a row for "'Eric ', 0]. However, when we apply rule (9) to the transformation, we get a value of ['Eric', 1], and the result is wrong!

Why is this so? After the transformation, we first prepare the intermediate data (augmentation) with LeftOuterJoin, and then aggregate it with GroupAgg. LeftOuterJoin generates a ['Eric ', NULL, NULL,...] for customer Eric In GroupAgg, the aggregate function COUNT(*) assumes that Eric has 1 row of data, so it outputs ['Eric', 1].

Here's a more complicated example with similar problems:

SELECT c_custkey FROM CUSTOMER WHERE 200000

< ( SELECT MAX(IF_NULL(o_totalprice, 42)) -- o_totalprice may be NULL FROM ORDERS WHERE o_custkey = c_custkey ) 作为总结,问题的根源在于:F(∅)≠F({NULL})F(∅)≠F({NULL}),这样的聚合函数 FF 都有这个问题。 变换后的 GroupAgg 无法区分它看到的 NULL 数据到底是 OuterJoin 产生的,还是原本就存在的,有时候,这两种情形在变换前的 ScalarAgg 中会产生不同的结果。 幸运的是,SQL 标准中定义的聚合函数 F(col)F(col) 都是 OK 的——它们都满足 F(∅)=F({NULL})F(∅)=F({NULL}),我们只要对 FF 稍加变换就能解决这个问题。 对于例子一,将 COUNT(*) 替换成一个对非空列(例如主键)的 Count 即可,例如:COUNT(o_orderkey); 对于例子二,需要把 MIN(IF_NULL(o_totalprice, 42)) 分成两步来做:定义中间变量 X,先用 Project 计算 X = IF_NULL(o_totalprice, 42),再对聚合函数 MIN(X) 进行去关联化即可。 集合运算的去关联化 最后一组优化规则用来处理带有 Union(对应 UNION ALL)、Subtract(对应 EXCEPT ALL) 和 Inner Join 算子的子查询。再强调一遍,我们的指导思想是:尽可能把 Apply 往下推、把 Apply 下面的算子向上提。 下面的等式中,×× 表示 Cross Join,⋈R.key⋈R.key 表示按照 RR 的 Key 做自然连接:r°e1°e2r°e1°e2 。和之前一样,我们假设 RR 存在主键或唯一键,如果没有也可以在 Scan 的时候加上一个。 注意到,这些规则与之前我们见过的规则有个显著的不同:等式右边 RR 出现了两次。这样一来,要么我们把这颗子树拷贝一份,要么做成一个 DAG 的执行计划,总之会麻烦许多。 事实上,这一组规则很少能派上用场。在 [2] 中提到,在 TPC-H 的 Schema 下甚至很难写出一个带有 Union All 的、有意义的子查询。 其他 有几个我认为比较重要的点,用 FAQ 的形式列在下面。 ► 是否任意的关联子查询都可以被去关联化? 可以说是这样的,在加上少量限定之后,理论上可以证明:任意的关联子查询都可以被去关联化。 证明方法在 [1]、[3] 中都有提及。以 [1] 中为例,思路大致是: 鸿蒙官方战略合作共建--HarmonyOS技术社区 对于任意的查询关系树,首先将关联子查询从表达式中提取出来,用 Apply 算子表示; 一步步去掉其中非基本关系算子,首先,通过等价变换去掉 Union 和 Subtract; 进一步缩小算子集合,去掉 OuterJoin、ALOJALOJ、A∃A∃、A∄A∄; 最后,去掉所有的 A×A×,剩下的关系树仅包含基本的一些关系算子,即完成了去关联化。 另一方面,现实世界中用户使用的子查询大多是比较简单的,本文中描述的这些规则可能已经覆盖到 99% 的场景。虽然理论上任意子查询都可以处理,但是实际上,没有任何一个已知的 DBMS 实现了所有这些变换规则。 ► HyPer 和 SQL Server 的做法有什么异同? HyPer 的理论覆盖了更多的去关联化场景。例如各种 Join 等算子,[3] 中都给出了相应的等价变换规则(作为例子,下图是对 Outer Join 的变换)。而在 [1] 中仅仅是证明了这些情况都可以被规约到可处理的情形(实际上嘛,可想而知,一定是没有处理的)。 另一个细节是,HyPer 中还存在这样一条规则:

where D=ΠF(T2)<$A(T1)(T1)D=ΠF(T2)<$A(T1)(T1) denotes the Distinct Project result for T1T1 (so-called Magic Set). It is easier to understand the equation by looking at the following example:

In the figure, before doing Apply, first get the Distinct value set of the column to be applied, take these values to do Apply, and then use the ordinary Join to connect the results of Apply.

The benefit of this is that if there is a lot of duplication in the applied data, the number of rows to Apply after Distinct Project is greatly reduced. In this way, even if Apply is not optimized later, the cost of iterative execution will be reduced.

Should these rules be used in RBO or CBO? In other words, is the execution plan necessarily better after decorrelation than before decorrelation?

The answer is, not necessarily.

Intuitively, if the left side of Apply has a small amount of data (for example, only one piece of data), it is better to carry it directly into the right side of Apply. Another case is when there is a suitable index on the right, in which case the cost of multiple applications is not unacceptable.

So it is more appropriate to put these rules into a CBO optimizer, which selects the optimal plan based on cost estimates. Even, in some cases, we apply these equations from right to left, doing "add correlations."

This is the same as HashJoin or NestedLoopJoin. In fact, NestedLoopJoin is a special case of Apply. It is common for NestedLoopJoin to be more efficient than HashJoin, if appropriate indexes exist.

At this point, the study of "how to understand SQL subquery optimization" is over, hoping to solve everyone's doubts. Theory and practice can better match to help you learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report