In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Data structure
IN and EXISTS are common complex conditions in SQL, and they also face these problems when converting SQL (stored procedures) into out-of-library computing for high performance. Based on the model defined by TPC-H, this article introduces how to use the syntax of aggregator to implement IN, EXISTS and optimize it.
TPC-H is a test standard developed by TPC transaction performance Committee for OLAP database management system, which simulates the real business application environment to evaluate the performance of decision support system in business analysis. The TPC-H model defines eight tables, and the table structure and table relationships are shown in the following figure:
IN constant set SQL example (1): select P_SIZE, P_TYPE, P_BRAND, count (1) as P_COUNTfrom PARTwhere P_SIZE in (2, 3, 8, 15, 17, 25, 27, 28, 30, 38, 41, 44, 45) and P_TYPE in ('SMALL BRUSHED NICKEL',' SMALL POLISHED STEEL') and P_BRAND not in ('Brand#12',' Brand#13') group by P_SIZE, P_TYPE P_BRAND optimization ideas:
If the number of constant set elements is less than 3, it can be translated into (f = = v1 | | f = = v2), and NOT IN corresponds to (f! = v1 & & f! = v2). More often, you can define the constant set as a sequence in the outer layer, and then use A.contain (f) to determine whether the field is in the sequence. Experience shows that binary search is significantly faster than sequential search when the number of elements is more than 10. If you want to use binary search, you need to sort the sequence first, and then use A.contain@b (f) for orderly search, NOT IN corresponds to! A.contain (f). Note that the sequence must be defined outside the loop function, otherwise it will be executed multiple times.
If the number of constant collection elements is very large, you can use join filtering, please refer to the following figure code.
Aggregator implementation:
AB1= [28,30,38pr 2,3,8,15,17,25,27 50,41,44,45] .sort () / A pair of constant sets can be sorted so that an ordered search of more than 13 sequence elements can be used to 2=file (PART) .cursor @ b (P_SIZE, P_TYPE, P_BRAND) / define cursors on the set file corresponding to the PART table. The parameter is the selected column 3=A2.select (A1.contain@b (P_SIZE) & & (P_TYPE = = "SMALL BRUSHED NICKEL" | | P_TYPE = = "SMALL POLISHED STEEL") & & (P_BRAND! = "Brand#12" & & P_BRAND! = "Brand#13")) / a pair of cursor additional filtering operations. Note that the constant sequence must be defined outside the filter function, otherwise it will be repeated 4=A3.groups (P_SIZE, P_TYPE, P_BRAND). Count (1): P_COUNT) / a pair of cursors are grouped to get the final result
If A1 has a particularly large number of elements, you can use the hash join method to filter, replacing line 3 with the following:
3=A2.select ((P_TYPE = "SMALL BRUSHED NICKEL" | | P_TYPE = = "SMALL POLISHED STEEL") & & (P_BRAND! = "Brand#12" & & P_BRAND! = "Brand#13") .join @ I (P_SIZE, A1SMALL BRUSHED NICKEL) / / A pair of cursors attach filtering operation followed by join filtering operation IN subquery subquery selected field is the primary key SQL example (2): select PS_SUPPKEY Count (1) as S_COUNTfrom PARTSUPPwhere PS_PARTKEY in (select P_PARTKEY from PART where P_NAME like 'bisque%%') group by PS_SUPPKEY optimization idea:
The subquery is filtered and read into memory, and then the outer table is hash connected with the first read memory table (subquery) to filter. The aggregator provides switch@i () and join@i () functions for hash connection filtering. Switch is a foreign key connection, which is used to turn the foreign key field into a guide field, so that you can directly reference the field pointing to the table through the foreign key field. The join function does not change the value of the foreign key field, but can be used for filtering only.
Aggregator implementation:
AB1=file (PART) .cursor @ b (P_PARTKEY, P_NAME) / defines a cursor on the set file corresponding to the PART table, with the parameter of selecting the column 2=A1.select (like (P_NAME, "bisque*"). Fetch () / a pair of cursors attach filtering operations and take the number 3=file (PARTSUPP) .cursor @ b (PS_SUPPKEY, PS_PARTKEY) / define cursors on the set file corresponding to the PARTSUPP table. The parameter is to select the column 4=A3.join@i (PS_PARTKEY, A2:P_PARTKEY) / a pair of PARTSUPP cursors for connection filtering, and the @ I option indicates the inner connection 5=A4.groups (PS_SUPPKEY). Count (1): S_COUNT) / a pair of cursor calculation groups to get the final result. The selected field of the subquery is not the primary key SQL example (3): select O_ORDERPRIORITY, count (*) as O_COUNT from ORDERSwhere O_ORDERDATE > = date '1995-10-01' and O_ORDERDATE
< date '1995-10-01' + interval '3' month and O_ORDERKEY in ( select L_ORDERKEY from LINEITEM where L_COMMITDATE< L_RECEIPTDATE )group by O_ORDERPRIORITY优化思路: 子查询过滤后按关联字段去重读入内存,然后就变成类似于主键的情况了,可以继续用上面说的 switch@i()、join@i() 两个函数用来做哈希连接过滤。 集算器实现: AB11995-10-01=after@m(A1,3)2=file(LINEITEM).cursor@b(L_ORDERKEY,L_COMMITDATE,L_RECEIPTDATE)/ 在 LINEITEM 表所对应的集文件上定义游标,参数为选出列3=A2.select(L_COMMITDATE < L_RECEIPTDATE)/ 对游标附加过滤操作4=A3.groups(L_ORDERKEY)/ 用 groups 对 L_ORDERKEY 去重5=file(ORDERS).cursor@b(O_ORDERKEY,O_ORDERDATE,O_ORDERPRIORITY)/ 在 ORDER 表所对应的集文件上定义游标,参数为选出列6=A5.select(O_ORDERDATE>= A1 & & O_ORDERDATE
< B1)/ 对游标附加过滤操作7=A6.join@i(O_ORDERKEY, A4:L_ORDERKEY)/ 对 ORDERS 游标进行连接过滤,@i 选项表示内连接8=A7.groups(O_ORDERPRIORITY; count(1):O_COUNT)/ 对游标计算分组得到最终结果子查询结果集内存放不下SQL 示例(3):select O_ORDERPRIORITY, count(*) as O_COUNTfrom ORDERSwhere O_ORDERDATE >= date '1995-10-01' and O_ORDERDATE
< date '1995-10-01' + interval '3' month and O_ORDERKEY in ( select L_ORDERKEY from LINEITEM where L_COMMITDATE< L_RECEIPTDATE )group by O_ORDERPRIORITY优化思路: IN 子查询相当于对子查询结果集去重然后跟外层表做内连接,而做连接效率较好的就是哈希连接和有序归并连接,所以这个问题就变成了怎么把 IN 翻译成高效的连接,下面我们来分析在不同的数据分布下如何把 IN 转成连接。 (1) 外层表数据量比较小可以装入内存: 先读入外层表,如果外层表关联字段不是逻辑主键则去重,再拿上一步算出来的关联字段的值对子查询做哈希连接过滤,最后拿算出来的子查询关联字段的值对外层表做哈希连接过滤。 (2) 外层表和内层表按关联字段有序: 此时可以利用函数 joinx() 来做有序游标的归并连接,如果内层表关联字段不是逻辑主键则需要先去重。此例中的 ORDERS 表和 LINEITEM 表是按照 ORDERKEY 同序存放,可以利用此方法来做优化。 (3) 内层表是大维表并且按主键有序存放: 集算器提供了针对有序大维表文件做连接的函数 A.joinx,其它方法跟内存能放下时的处理类似在此不再描述。 集算器实现(1): AB11995-10-01=after@m(A1,3)2=file(ORDERS).cursor@b(O_ORDERKEY,O_ORDERDATE,O_ORDERPRIORITY)/ 在 ORDER 表所对应的集文件上定义游标,参数为选出列3=A2.select(O_ORDERDATE>= A1 & & O_ORDERDATE
< B1).fetch()/ 对游标附加过滤操作并取数4=file(LINEITEM).cursor@b(L_ORDERKEY,L_COMMITDATE,L_RECEIPTDATE)/ 在 LINEITEM 表所对应的集文件上定义游标,参数为选出列5=A4.select(L_COMMITDATE < L_RECEIPTDATE).join@i(L_ORDERKEY,A3:O_ORDERKEY)/ 对游标附加过滤操作和链接过滤操作6=A5.groups(L_ORDERKEY)/ 对 L_ORDERKEY 去重7=A3.join@i(O_ORDERKEY, A6:L_ORDERKEY)/ 对排列执行链接过滤操作8=A7.groups(O_ORDERPRIORITY;count(1):O_COUNT)/ 对排列计算分组得到最终结果集算器实现(2): AB11995-10-01=after@m(A1,3)2=file(ORDERS).cursor@b(O_ORDERKEY,O_ORDERDATE,O_ORDERPRIORITY)/ 在 ORDER 表所对应的集文件上定义游标,参数为选出列3=A2.select(O_ORDERDATE>= A1 & & O_ORDERDATE
< B1)/ 对游标附加过滤操作4=file(LINEITEM).cursor@b(L_ORDERKEY,L_COMMITDATE,L_RECEIPTDATE)/ 在 LINEITEM 表所对应的集文件上定义游标,参数为选出列5=A4.select(L_COMMITDATE < L_RECEIPTDATE)/ 对游标附加过滤操作6=A5.group@1(L_ORDERKEY)/ 按 L_ORDERKEY 去重7=joinx(A3:ORDER, O_ORDERKEY; A6, L_ORDERKEY)/ 对有序游标执行内连接8=A7.groups(ORDER.O_ORDERPRIORITY:O_ORDERPRIORITY;count(1):O_COUNT)/ 对游标计算分组得到最终结果EXISTS 等值条件 此章节的优化思路和 IN 子查询的优化思路是相同的,事实上这种 EXISTS 也都可以用 IN 写出来(或者倒过来,把 IN 用 EXISTS 写出来)。 子查询关联字段是主键SQL 示例(4):select PS_SUPPKEY, count(1) as S_COUNTfrom PARTSUPPwhere exists ( select * from PART where P_PARTKEY = PS_PARTKEY and P_NAME like 'bisque%%' )group by PS_SUPPKEY优化思路: 子查询过滤后读入内存,然后外层表与先读入的内存表(子查询)做哈希连接进行过滤。集算器提供了 switch@i()、join@i() 两个函数用来做哈希连接过滤,switch 是外键式连接,用来把外键字段变成指引字段,这样就可以通过外键字段直接引用指向表的字段,join 函数不会改变外键字段的值,可用于只过滤。 集算器实现: AB1=file(PART).cursor@b(P_PARTKEY, P_NAME)/ 在 PART 表所对应的集文件上定义游标,参数为选出列2=A1.select(like(P_NAME, "bisque*")).fetch()/ 对游标附加过滤操作并取数3=file(PARTSUPP).cursor@b(PS_SUPPKEY, PS_PARTKEY)/ 在 PARTSUPP 表所对应的集文件上定义游标,参数为选出列4=A3.join@i(PS_PARTKEY, A2:P_PARTKEY)/ 对 PARTSUPP 游标进行连接过滤,@i 选项表示内连接5=A4.groups(PS_SUPPKEY; count(1):S_COUNT)/ 对游标计算分组得到最终结果子查询关联字段不是主键SQL 示例(5):select O_ORDERPRIORITY, count(*) as O_COUNTfrom ORDERSwhere O_ORDERDATE >= date '1995-10-01' and O_ORDERDATE
< date '1995-10-01' + interval '3' month and exists ( select * from LINEITEM where L_ORDERKEY = O_ORDERKEY and L_COMMITDATE < L_RECEIPTDATE )group by O_ORDERPRIORITY优化思路: 子查询过滤后按关联字段去重读入内存,然后就变成类似于主键的情况了,可以继续用上面说的 switch@i()、join@i() 两个函数用来做哈希连接过滤。 集算器实现: AB11995-10-01=after@m(A1,3)2=file(LINEITEM).cursor@b(L_ORDERKEY,L_COMMITDATE,L_RECEIPTDATE)/ 在 LINEITEM 表所对应的集文件上定义游标,参数为选出列3=A2.select(L_COMMITDATE < L_RECEIPTDATE)/ 对游标附加过滤操作4=A3.groups(L_ORDERKEY)/ 对 L_ORDERKEY 去重5=file(ORDERS).cursor@b(O_ORDERKEY,O_ORDERDATE,O_ORDERPRIORITY)/ 在 ORDER 表所对应的集文件上定义游标,参数为选出列6=A5.select(O_ORDERDATE>= A1 & & O_ORDERDATE
< B1)/ 对游标附加过滤操作7=A6.join@i(O_ORDERKEY, A4:L_ORDERKEY)/ 对 ORDERS 游标进行连接过滤,@i 选项表示内连接8=A7.groups(O_ORDERPRIORITY; count(1):O_COUNT)/ 对游标计算分组得到最终结果子查询结果集内存放不下SQL 示例(5):select O_ORDERPRIORITY, count(*) as O_COUNTfrom ORDERSwhere O_ORDERDATE >= date '1995-10-01' and O_ORDERDATE
< date '1995-10-01' + interval '3' month and exists ( select * from LINEITEM where L_ORDERKEY = O_ORDERKEY and L_COMMITDATE < L_RECEIPTDATE )group by O_ORDERPRIORITY优化思路: 等值 EXISTS 相当于对内部表关联字段去重然后跟外层表做内连接,而做连接效率较好的就是哈希连接和有序归并连接,所以这个问题就变成了怎么把 EXISTS 翻译成高效的连接,下面我们来分析在不同的数据分布下如何把 EXISTS 转成连接。 1、外层表数据量比较小可以装入内存: 先读入外层表,如果外层表关联字段不是逻辑主键则去重,再拿上一步算出来的关联字段的值对子查询做哈希连接过滤,最后拿算出来的子查询关联字段的值对外层表做哈希连接过滤。 2、外层表和内层表按关联字段有序: 此时可以利用函数 joinx() 来做有序游标的归并连接,如果内层表关联字段不是逻辑主键则需要先去重。此例中的 ORDERS 表和 LINEITEM 表是按照 ORDERKEY 同序存放,可以利用此方法来做优化。 3、内层表是大维表并且按主键有序存放: 集算器提供了针对有序大维表文件做连接的函数 A.joinx,其它方法跟内存能放下时的处理类似在此不再描述。 集算器实现(1): AB11995-10-01=after@m(A1,3)2=file(ORDERS).cursor@b(O_ORDERKEY,O_ORDERDATE,O_ORDERPRIORITY)/ 在 ORDER 表所对应的集文件上定义游标,参数为选出列3=A2.select(O_ORDERDATE>= A1 & & O_ORDERDATE
< B1).fetch()/ 对游标附加过滤操作并取数4=file(LINEITEM).cursor@b(L_ORDERKEY,L_COMMITDATE,L_RECEIPTDATE)/ 在 LINEITEM 表所对应的集文件上定义游标,参数为选出列5=A4.select(L_COMMITDATE < L_RECEIPTDATE).join@i(L_ORDERKEY,A3:O_ORDERKEY)/ 对游标附加过滤操作和链接过滤操作6=A5.groups(L_ORDERKEY)/ 对 L_ORDERKEY 去重7=A3.join@i(O_ORDERKEY, A6:L_ORDERKEY)/ 对排列执行链接过滤操作8=A7.groups(O_ORDERPRIORITY;count(1):O_COUNT)/ 对排列计算分组得到最终结果集算器实现(2): AB11995-10-01=after@m(A1,3)2=file(ORDERS).cursor@b(O_ORDERKEY,O_ORDERDATE,O_ORDERPRIORITY)/ 在 ORDER 表所对应的集文件上定义游标,参数为选出列3=A2.select(O_ORDERDATE>= A1 & & O_ORDERDATE
< B1)/ 对游标附加过滤操作4=file(LINEITEM).cursor@b(L_ORDERKEY,L_COMMITDATE,L_RECEIPTDATE)/ 在 LINEITEM 表所对应的集文件上定义游标,参数为选出列5=A4.select(L_COMMITDATE < L_RECEIPTDATE)/ 对游标附加过滤操作6=A5.group@1(L_ORDERKEY)/ 按 L_ORDERKEY 去重7=joinx(A3:ORDER, O_ORDERKEY; A6, L_ORDERKEY)/ 对有序游标执行内连接8=A7.groups(ORDER.O_ORDERPRIORITY:O_ORDERPRIORITY;count(1):O_COUNT)/ 对游标计算分组得到最终结果EXISTS 非等值条件同表关联SQL 示例(6):select L_SUPPKEY, count(*) as numwaitfrom LINEITEM L1,where L1.L_RECEIPTDATE >L1.L_COMMITDATE and exists (select * from LINEITEM L2 where L2.L_ORDERKEY = L1.L_ORDERKEY and L2.L_SUPPKEY L1.L_SUPPKEY) and not exists (select * From LINEITEM L3 where L3.L_ORDERKEY = L1.L_ORDERKEY and L3.L_SUPPKEY L1.L_SUPPKEY and L3.L_RECEIPTDATE > L3.L_COMMITDATE) group by L_SUPPKEY optimization ideas:
Let's first take a look at the data characteristics of the LINEITEM table. The primary keys of the LINEITEM table are L_ORDERKEY and L_LINENUMBER. An order corresponds to multiple records in LINEITEM. The L_ORDERKEY of these records is the same and is adjacent in the data file. After knowing this information, we analyze the above SQL. The condition is to find out the orders that are supplied by multiple suppliers and there is only one supplier who did not deliver on time, because the data is stored in order order, so we can group the orders in order, and then cycle each group of orders to determine whether there are order items that have not been delivered on time and whether there are multiple suppliers. And is there only one supplier who failed to deliver the goods on time?
Aggregator implementation:
AB1=file (LINEITEM) .cursor @ b / define cursors on the set file corresponding to the LINEITEM table The parameter is to select the column 2=A1.group (L_ORDERKEY) / a pair of ordered cursors with additional groups. The result is the cursor 3=A2.conj ((t=~.select (L_RECEIPTDATE > L_COMMITDATE), if (t.len () > 0 & & t.select@1 (t (1). Lumped SUPPKEY ordered ordered cursor sup PPKEY) = = null & & ~ .select @ 1 (t (1). Lumped SUPPKEY ordered cursor Lemma SUPPKEY)! Null)) / select orders in each group that are not shipped on time and assign values to the temporary variable t If t length is greater than 0 and there is only one supplier in t and there are multiple suppliers in this group, t is returned, otherwise null,conj is equivalent to the reverse operation 4=A3.groups@u of group (L_SUPPKEY Count (1): numwait) / a pair of cursor calculation groups to get the final result summary
When there is no null value, the IN of the belt query can all be described by EXISTS, and the same query requirement can be described by IN and translated into the same aggregator code by EXISTS description, so we just need to figure out how to translate and optimize EXISTS to know how to deal with IN.
Equivalent exist is essentially a join. The two more efficient ways to join two tables are hash join and ordered merge join. For translating select * * from A where exists (select * from B where * *) style SQL, we need to find out the following information first:
(1) whether the associated field is the primary key or logical primary key of each table
(2) the size of tables An and B, and whether they can be loaded into memory after other filtering conditions are implemented.
(3) if no table can be loaded into memory, check whether the two tables are ordered by associated fields.
If there is a table that can be loaded into memory, you can choose to hash the connection. The related aggregator functions have two cs.switch () and cs.join (). These two functions have two available options @ I and @ d corresponding to exists and not exists, respectively. The table in the parameters is required to be unique according to the associated field value. If it is not a logical primary key, it should be deduplicated first, which can be deduplicated by A.groups (). If both tables are too large to be loaded into memory, check whether the two tables are ordered by associated fields. If they are out of order, you can use cs.sortx () to sort them, and you can use joinx () to join the ordered two tables.
Non-equivalent operation needs to analyze the operation logic to see if it can be converted to grouping and then calculated. If not, it can only use the way of nested loop connection, and the corresponding function is xjoin ().
After knowing this information and mastering several functions related to the aggregator, we will be able to write efficient code.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.