Align JOIN field types and use arrays instead of strings to reduce string processing overhead and column storage to reduce scanning overhead 07/03 Update SLTechnology News&Howtos

Align JOIN field types and use arrays instead of strings to reduce string processing overhead and column storage to reduce scanning overhead

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Background

Greenplum is usually used as OLAP. In the process of some users, poor performance may be caused by data structure design, SQL problems and other reasons. Although the problem can be solved by adding nodes, if it can be optimized, it can save a lot of hardware resources.

For example

1. Align JOIN field types. If the field type of the equivalent JOIN is inconsistent, HASH JOIN cannot be used.

2. Align where condition field types. As above, you cannot use HASH JOIN or index scanning.

3. Use arrays instead of strings to reduce the overhead of string processing. If the string itself requires a lot of formatting to process FILTER, the performance of using arrays is much better.

4. Column storage reduces the scanning overhead. Because of the limited fields involved in statistical SQL, the performance of column storage is much better than that of row storage.

Examples

This query takes 230 seconds.

SELECT col4,count (DISTINCT c.col1) ptnum from tbl1 an INNER JOIN tbl2 b on b.col2=a.id inner join tbl3 T2 on t2.ID'0' and c.col1 not in ('xxxxxx') GROUP BY col4

2. Use explain analyze to analyze the bottleneck

3. Question:

3.1.The type of JOIN is inconsistent, resulting in not using HASH JOIN.

3.2. when there are two tables JOIN, Cartesian product is generated to judge the difference, and the amount of data needs to be calculated several trillions of times after superposition.

The format of the tbl2.col3 string is as follows (tens of trillions of calculations are required)

{"2": "1", "10": "1", "13": "1", "16": "1", "21": "1", "26": "1", "28": "1", "30": "1", "32": "1", "33": "1", "34": "1", "35": "1", "36": "1", "37": "1", "39": "1", "40": "1" "99": "2", "100": "2", "113": "1", "61": "1", "63": "4", "65": "2"}

3.3. Row storage is used, the amount of scanning during query is large, and vector calculation cannot be used.

Optimize

1. Use column storage instead of row storage (except for the inner table tbl3 of nestloop, continue to use index FILTER)

Create table tmp_tbl1 (like tbl1) WITH (APPENDONLY=true, ORIENTATION=column); insert into tmp_tbl1 select * from tbl1; create table tmp_tbl4 (like tbl4) WITH (APPENDONLY=true, ORIENTATION=column); insert into tmp_tbl4 select * from tbl4; create table tmp_tbl5 (like tbl5) WITH (APPENDONLY=true, ORIENTATION=column); insert into tmp_tbl5 select * from tbl5; create table tmp_tbl2 (like tbl2) WITH (APPENDONLY=true, ORIENTATION=column) distributed by (col2); insert into tmp_tbl2 select * from tbl2

2. Use array instead of text

Alter table tmp_tbl2 alter column col3 type text [] using (case col3 when'[] 'then' {} 'else replace (col3,' ",'') end):: text []

The modified type and content are as follows

Digoal= > select col3 from tmp_tbl2 limit 2 Col3- : 1Magne39PULLING 999PULLING 2 100JONFING 113JIANJING 40WOBG 1MAG 57GRAPHY 63RAPHY PLOGRAPHY 64RING 1MULING 65Rod 4} (2 rows)

3. Keep the join fields consistent

Alter table tmp_tbl2 alter column col2 type int8

4. Modify the original query SQL to the following (string processing becomes an array)

(in this example, you can also use a two-dimensional array to completely circumvent string processing. )

SELECT col4,count (DISTINCT c.col1) ptnum from tmp_tbl1 an INNER JOIN tmp_tbl2 b on b.col2=a.id inner join tbl3 T2 on t2.ID'0' and c.col1 not in ('xxxxxx') GROUP BY col4

Carry out the plan

QUERY PLAN -Gather Motion 32:1 (slice7 Segments: 32) (cost=543258065.87..543259314.50 rows=41621 width=12)-> GroupAggregate (cost=543258065.87..543259314.50 rows=1301 width=12) Group By: a.col4-> Sort (cost=543258065.87..543258169.93 rows=1301 width=12) Sort Key: a.col4-> Redistribute Motion 32:32 (slice6 Segments: 32 (cost=542355803.38..543254872.50 rows=1301 width=12) Hash Key: a.col4-> GroupAggregate (cost=542355803.38..543254040.08 rows=1301 width=12) Group By: a.col4-> Sort (cost=542355803.38..542655042.19 rows=3740486 width=11) Sort Key: a.col4-> Redistribute Motion 32:32 (slice5 Segments: 32) (cost=6247.23..518770960.13 rows=3740486 width=11) Hash Key: c.col1-> Hash Join (cost=6247.23..516377049.63 rows=3740486 width=11) Hash Cond: split_part (b.col3 [t2.id::integer],':':: text 1) = c.id::text-> Nested Loop (cost=5494.14..476568597.41 rows=3852199 width=491) Join Filter: split_part (b.col3 [t2.id::integer],':':: text 2) > '0'::text-> Broadcast Motion 32:32 (slice3 Segments: 32) (cost=5494.14..115247.73 rows=277289 width=483)-> Hash Join (cost=5494.14..23742.36 rows=8666 width=483) Hash Cond: b.col2 = a.id -> Seq Scan on tmp_tbl2 b (cost=0.00..14088.89 rows=8666 width=487)-> Hash (cost=4973.86..4973.86 rows=1301 width=12) -> Redistribute Motion 32:32 (slice2 Segments: 32) (cost=2280.93..4973.86 rows=1301 width=12) Hash Key: a.id-> Hash Join (cost=2280.93..4141.42 rows=1301 width=12) Hash Cond: s.id = a.col4-> Append-only Columnar Scan on tmp_tbl5 s (cost=0.00..1220.97 rows=1491 width=4) -> Hash (cost=1760.66..1760.66 rows=1301 width=12)-> Redistribute Motion 32:32 (slice1 Segments: 32) (cost=0.00..1760.66 rows=1301 width=12) Hash Key: a.col4-> Append -only Columnar Scan on tmp_tbl1 a (cost=0.00..928.22 rows=1301 width=12)-> Index Scan using idx_codeid on tbl3 T2 (cost=0.00..23.69 rows=42 width=8) Index Cond: t2.id Hash (cost=364.69..364.69) Rows=972 width=11)-> Broadcast Motion 32:32 (slice4 Segments: 32) (cost=0.00..364.69 rows=972 width=11)-> Append-only Columnar Scan on tmp_tbl4 c (cost=0.00..44.26 rows=31 width=11) Filter: col1 'xxxxxx'::text Settings: effective_cache_size=8GB; enable_nestloop=off Gp_statistics_use_fkeys=on Optimizer status: legacy query optimizer (39 rows) performance improvement

Original SQL response time: 230s

Modified SQL response time: < 16 seconds

Summary bottleneck analysis

1. JOIN conditions vary from time to time, and must be judged one by one in a Cartesian way, so if the FILTER condition is time-consuming (CPU), then the performance is certainly not much better.

2, the original large number of reverse, split, replace string calculation, very time-consuming. It just landed on Descartes, calculating trillions of times.

3. The type of JOIN field is inconsistent. HASH JOIN is not used.

4. Analyze SQL without using column storage.

Optimization means

1. Array replaces the string.

2. Rewrite SQL

3. Align the JOIN type.

4. Use column storage.

5. Reserved NESTLOOP JOIN, inner table keeps row storage, and index scan is used. (if it is a small table, you can use physical and chemical scanning, faster)

6 、 analyze table

Original address: https://github.com/digoal/blog/blob/master/201809/20180904_05.md

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.