Using orderly and efficient implementation of intersection and Union difference set Operation 07/03 Update SLTechnology News&Howtos

Using orderly and efficient implementation of intersection and Union difference set Operation

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The seemingly simple set operation is placed in the scene of big data. If you want to obtain high performance, you need to fully understand the data characteristics and calculation characteristics to design efficient algorithms. Making full use of order operation is a good way!

Intersect/union/minus is a common set operation, and the corresponding intersect/union/minus calculation in SQL is also very simple. However, when the data volume is large, the performance of such set operations is often low, especially when the amount of data involved in the calculation exceeds the memory capacity, the performance will be very poor.

In this paper, we discuss how to use SPL language to improve the performance of intersection, union and difference set operations significantly. The following discussion uses an actual user evaluation case when selecting a database: the data is based on two tables of the database, totaling 10.5 billion rows of data. After performing the relevant operations, the time taken to output the first 500 records is used to measure which database performance is better.

test environment

category

configured

models

X86

CPU

E5-2680 v4 @ 2.40GHz

memory

512GB

Data hard disk storage

SAS 3TB

number of clusters

4 sets

network environment

Gigabit

MPP Database Resource Configuration (Single Node)

Hard disk: SSD 1.9T

Memory: 20GB(JVM) + 12GB(fragmentation library)

Aggregator Resource Configuration (Single Node)

Hard drive: SAS 1T

Memory: 120GB

operating system

CentOS 6.8 (64 bit)

Data Description Data Status

table name

amount of data

data structure

data content

10.368 billion

52 fields, a1 is timestamp. Other fields are character type, length =10

a1 is evenly distributed data for 1 day, with a time span of 5 days (24000 pieces per second), and other field data are randomly generated.

17,280,480

2 character fields, length =10

2 fields generated according to a3 and a7 fields of Table A

index

data table

database index

A, table

The fields a1, a3 and a7 establish btree indexes, and a4 establishes btree-based varchar_pattern_ops indexes.

B, table

b1, b2 fields establish btree index

sample data

a1-a52 column values:

2018-01-07 00:00:00,8888888888,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt,MQJqxnXMLM,ccTTCC7755,aaa******8,ppppaaaavv,gggggttttt

2018-01-07 00:00:00,4444444444,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR,dv@bi-lyMF,qqoovv22ww,)))777FFF4,jjjjIIIIVV,aaaaaRRRRR

2018-01-07 00:00:00,9999999999,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP,bxk3J/2YDd,ppvv**--88,uuuNNNBBBA,BBBBhhhhjj,_____PPPPP

Test case l intersection

select * from A, B where a1>'2018-01-07 02:14:43' and a1

< '2018-01-07 04:14:43' and a3=b1 or a7 = b2 intersect select * from A, B where a1>

'2018-01-07 12:14:43' and a1

< '2018-01-07 14:14:43' and a3=b1 or a7=b2 l 并集（union） select * from A, B where a1>

'2018-01-07 02:14:43' and a1

< '2018-01-07 04:14:43' and a3=b1 or a7 = b2 union select * from A, B where a1>

'2018-01-07 12:14:43' and a1

< '2018-01-07 14:14:43' and a3=b1 or a7=b2 l 差集（minus） select * from A, B where a1>

'2018-01-07 02:14:43' and a1

< '2018-01-07 04:14:43' and a3=b1 or a7 = b2 minus select * from A, B where a1>

'2018-01-07 12:14:43' and a1

< '2018-01-07 14:14:43' and a3=b1 or a7=b2 用例分析分析上述 SQL 可以发现，此计算场景为大数据量的多对多集合运算。查询条件的前半段（a1>

'2018-01-07 02:14:43' and a1

< '2018-01-07 04:14:43' and a3=b1）是 A 表 2 个小时内的数据与 B 表进行多对多关联；而后半段（or a7 = b2）则是 A 表全量数据和 B 表进行多对多关联。因此，这个用例主要考察的是大表 A 和小表 B 多对多关联后的集合运算性能。实测时，该 SQL 使用 MPP 数据库得不到查询结果（运行时间超过 1 小时），因为数据量很大，内存无法容纳全部数据，从而造成数据库在运算时频繁进行磁盘交互，导致整体性能极低。按照我们一贯的思路，要实施高性能计算必须设计符合数据特征和计算特征的算法，而不是简单地使用通用的算法。这里，为了避免过多的磁盘交互（这也是大数据规模计算的首要考虑目标），最好只遍历一次 A 表就能完成计算。观察数据可以发现，A 表包含时间字段（a1），而且在时间字段（a1）和关联字段（a3、a7）上均建有索引，同样 B 表的两个字段（b1、b2）也建有索引，这样，我们就可以设计出这样的算法： 1) 根据 A 表数据生成的特点，逐秒读取 A 表数据（每秒 24000 条）； 2) 针对每秒的数据循环处理，根据过滤条件逐条与 B 表关联，返回关联后结果； 3) 对两部分数据，即用于交并差的两个集合进行集合运算。通过以上三步就可以完成全部计算，而整个过程中对 A 表只遍历了 2 次（分别得到用于交并差的两个集合）。当然，整个过程中由于数据量太大，集算器将通过延迟游标的方式进行归并，游标归并时数据需要事先排序，所以在 1)和 2) 步之间还需要对每秒的 24000 条数据按照关联字段和其他字段排序，会产生一些额外的开销。下面是具体的集算器 SPL 脚本。 SPL 实现这里分主子两个程序，主程序调用子程序分别获得交 / 并 / 差运算的两个集合并进行集合运算，子程序则根据参数计算集合，也就是说用例中的交并差三类计算可以使用同一个子程序脚本。子程序脚本（case1_cursor.dfx） A B C 1 =otherCols=["a2","a4","a5","a6"]|(to(8,52).("a"/~)) 2 =connect("apptest") 3 =sql="select a1,a3,a7,"+otherCols.string("||")+"aall from A where a1=?" 4 =B=A2.query("select b1,b2,count(1) b3 from B group by b1,b2 order by b1,b2") 5 for 5*24*60*60 =A2.query(sql,elapse@s(datetime("2018-01-07 00:00:00"),A5-1)) 6 =B5.sort(a3,a7,aall) 7 for B6 =B.select(B7.a1>

begin && B7.a1

< end && B7.a3 == B.b1 || B7.a7 == B.b2) 8 return C7.news(b3;B7.a1:a1,B7.a3:a3,B7.a7:a7,B7.aall:aall,C7.b1:b1,C7.b2:b2) 9 =A2.close() A1：在 otherCols 中记录 A 表 52 个字段中除参与运算的 a1,a3,a7 外其他所有字段名称，用于生成 SQL 查询 A2：连接数据库 A3：SQL 语句串，用于根据条件查询 A 表所有列数据 A4：查询 B 表数据，针对 b1，b2 进行分组计数（以便在后续计算中减少比较次数），并按 b1，b2 排序（用于后续有序归并） A5：按照 5 天时间内的秒数进行循环 B5：每次循环中在起始时间（2018-01-07 00:00:00）上加相应的秒数，查询那一秒产生的数据（24000 条） B6：按照关联字段以及其他字段排序 B7：循环处理一秒内的每条 A 表数据 C7：根据单条 A 表数据，在 B 表中查找符合条件的记录 C8：返回计算后包含 A 表和 B 表所有字段值的结果集，这里使用了 A.news() 函数，用来计算得到序表 / 排列的字段值合并生成的新序表 / 排列，具体用法请参考http://doc.raqsoft.com.cn/esproc/func/news.html 主程序脚本交集（intersect） A 1 =cursor("case1_cursor.dfx",datetime("2018-01-07 02:14:43"),datetime("2018-01-07 04:14:43")) 2 =cursor("case1_cursor.dfx",datetime("2018-01-07 12:14:43"),datetime("2018-01-07 14:14:43")) 3 =[A1,A2].mergex@i(a1,a3,a7,aall,b1,b2) 4 =A3.fetch@x(500) 5 return A4 A1，A2：通过 cursor()函数调用子程序脚本，并传入 2 个时间段参数；cursor() 函数原理请参考：《百万级分组大报表开发与呈现》 A3：根据子程序返回的游标序列进行归并，使用 @i 选项完成交集运算 A4：从游标中取出 500 条记录，并关闭游标（@x 选项）并集（union） A 1 =cursor("case1_cursor.dfx",datetime("2018-01-07 02:14:43"),datetime("2018-01-07 04:14:43")) 2 =cursor("case1_cursor.dfx",datetime("2018-01-07 12:14:43"),datetime("2018-01-07 14:14:43")) 3 =[A1,A2].mergex@u(a1,a3,a7,aall,b1,b2) 4 =A3.fetch@x(500) 5 return A4 A3：使用 @u 选项完成并集计算，其他 SPL 脚本完全相同差集（minus） A 1 =cursor("case1_cursor.dfx",datetime("2018-01-07 02:14:43"),datetime("2018-01-07 04:14:43")) 2 =cursor("case1_cursor.dfx",datetime("2018-01-07 12:14:43"),datetime("2018-01-07 14:14:43")) 3 =[A1,A2].mergex@d(a1,a3,a7,aall,b1,b2) 4 =A3.fetch@x(500) 5 return A4 A3：使用 @d 选项完成并集计算，其他 SPL 脚本完全相同性能表现下表对集算器 SPL 和数据库 SQL 分别输出第一个 500 条结果集的时间进行了比较：计算类型集算器SPL 数据库SQL 交集（intersect） 3.8秒 >

1 hour

union (union)

3.9 seconds

>1 hour

Difference (minus)

>1 hour

Obviously, the performance of intersection and union computations is greatly improved.

Why is difference set computation slow?

The difference set operation is still slow because of the characteristics of the data. Since there are many duplicate records after multi-pair multi-association, it is still necessary to traverse Table A to calculate the difference set that meets the conditions (while the other two calculations can obtain 500 result sets), so the performance is mainly consumed in IO fetching.

summary

High-performance algorithms require targeted design based on data and computational characteristics, which requires programmers to first be able to come up with high-performance algorithms and then implement them in less complex ways, otherwise they are not feasible.

For SQL architecture, due to its closeness, some efficient algorithms may be difficult to design or even impossible to implement. SPL improves this problem greatly. Users can design high performance algorithms and implement them quickly based on SPL architecture.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.