How to calculate count in distributed Database 04/22 Update SLTechnology News&Howtos

How to calculate count in distributed Database

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to calculate count in distributed database". In daily operation, I believe many people have doubts about how to calculate count in distributed database. Xiaobian consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubt of "how to calculate count in distributed database". Next, please follow the editor to study!

Background

In a distributed database, to calculate count (distinct xxx), you need to set the fields of distinct

1. Weight removal

2. Redistribute the deduplicated data (in this step, if the distinct value is very high, it will be time-consuming)

3. Then re-weight it.

4. Finally count (xxx)

5. Find the count SUM of all nodes.

For example, here is an example of an execution plan for Greenplum

Postgres=# explain analyze select count (distinct c_acctbal) from customer QUERY PLAN-- - -Aggregate (cost=182242.41..182242.42 rows=1 width=8) Rows out: 1 rows with 0.006 ms to first row 69 ms to end, start offset by 23 ms. Gather Motion 16:1 (slice2; segments: 16) (cost=53392.85..173982.82 rows=660767 width=8) Rows out: 818834 rows at destination with 3.416 ms to first row, 447 ms to end, start offset by 23 ms. -> HashAggregate (cost=53392.85..61652.43 rows=41298 width=8) Group By: customer.c_acctbal Rows out: Avg 51177.1 rows x 16 workers. Max 51362 rows (seg3) with 0.004 ms to first row, 33 ms to end, start offset by 25 ms. -> Redistribute Motion 16:16 (slice1; segments: 16) (cost=30266.00..43481.34 rows=41298 width=8) Hash Key: customer.c_acctbal Rows out: Avg 89865.6 rows x 16 workers at destination. Max 90305 rows (seg3) with 18 ms to first row, 120 ms to end, start offset by 25 ms. -> HashAggregate (cost=30266.00..30266.00 rows=41298 width=8) Group By: customer.c_acctbal Rows out: Avg 89865.6 rows x 16 workers. Max 89929 rows (seg2) with 0.007 ms to first row, 33 ms to end, start offset by 26 ms. -> Append-only Columnar Scan on customer (cost=0.00..22766.00 rows=93750 width=8) Rows out: Avg 93750.0 rows x 16 workers. Max 93751 rows (seg4) with 20 ms to first row, 30 ms to end, start offset by 26 ms. Slice statistics: (slice0) Executor memory: 387K bytes. (slice1) Executor memory: 6527K bytes avg x 16 workers, 6527K bytes max (seg0). (slice2) Executor memory: 371K bytes avg x 16 workers, 371K bytes max (seg0). Statement statistics: Memory used: 1280000K bytes Optimizer status: legacy query optimizer Total runtime: 723.143 ms (23 rows)

Here is an example of citus

Postgres=# explain analyze select count (distinct bid) from pgbench_accounts QUERY PLAN-- - -- Aggregate (cost=0.00..0.00 rows=0 width=0) (actual time=31.748..31.749 rows=1 loops=1)-> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) (actual time=31.382..31.510 rows=1280 loops=1) Task Count: 128Tasks Shown: One of 128-> Task Node: host=172.24.211.224 port=1921 dbname=postgres -> HashAggregate (cost=231.85..231.95 rows=10 width=4) (actual time=3.700..3.702 rows=10 loops=1) Group Key: bid-> Seq Scan on pgbench_accounts_106812 pgbench_accounts (cost=0.00..212.48 rows=7748 width=4) (actual time=0.017..2.180 rows=7748 loops=1) Planning time: 0.445 ms Execution time: 3.781 ms Planning time: 1.399 ms Execution time: 32.159 ms (13 rows)

For scenarios that can be evaluated, that is, scenarios where exact distinct values are not required, PostgreSQL provides a plug-in called hll, which can be used to estimate the number of distinct elements.

Citus combined with hll, you can achieve ultra-high-speed count (distinct xxx), even if the distinct value is very many, it is not slow.

SET citus.count_distinct_error_rate to 0.005; 0.005 indicates distortion hll acceleration citus count (distinct xxx) use example deployment

1. All nodes (coordinator and worker nodes), install hll software

Yum install-y gcc-c++ cd ~ / git clone https://github.com/citusdata/postgresql-hll cd postgresql-hll. / var/lib/pgsql/.bash_profile USE_PGXS=1 make USE_PGXS=1 make install

2. All nodes (coordinator and worker nodes), add plug-ins to the DB that needs to use HLL

Su-postgres-c "psql-d postgres-c 'create extension hll;'" su-postgres-c "psql-d newdb-c' create extension hll;'" use example 1. Create a test table, 128 shardcreate table test (id int primary key, an int, b int, c int); set citus.shard_count = 128; select create_distributed_table ('test',' id') 2. Write 1 billion test data, a field 10 unique value, b field 1 million unique value, c field 1 million unique value insert into test select id, random () * 9, random () * 99, random () * 999999 from generate_series (1m 1000000000) t (id); 3, (coordinator node) sets global or current session-level parameters and specifies the degree of distortion. The smaller the distortion, the smaller the distortion. Newdb=# explain select count (distinct bid) from pgbench_accounts group by bid QUERY PLAN -HashAggregate (cost=0.00..0.00 rows=0 width=0) Group Key: remote_scan.worker_column_2-> Custom Scan (Citus Real-Time) (cost=0.00..0.00 Rows=0 width=0) Task Count: 128Tasks Shown: One of 128-> Task Node: host=172.24.211.224 port=8001 dbname=newdb-> GroupAggregate (cost=97272.79..105102.29 rows=1000 width=36) Group Key: bid-> Sort (cost=97272.79..99227.04 rows=781700 width=4) Sort Key: bid-> Seq Scan on pgbench_accounts_102008 pgbench_accounts (cost=0.00..20759.00 rows=781700 width=4) (12 rows) 4, Compare whether to use HLL acceleration (a small number of unique values There is no performance improvement in HLL because there is no bottleneck on its own) 4.1.Unused hllnewdb=# set citus.count_distinct_error_rate to 0 Newdb=# select count (distinct bid) from pgbench_accounts; count-1000 (1 row) Time: 423.364 ms postgres=# set citus.count_distinct_error_rate to 0; postgres=# select count (distinct a) from test; count-10 (1 row) Time: 2392.709 ms (00row 02.393) 4.2, using hllnewdb=# set citus.count_distinct_error_rate to 0.005 Newdb=# select count (distinct bid) from pgbench_accounts; count-1000 (1 row) Time: 444.287 ms postgres=# set citus.count_distinct_error_rate to 0.005; postgres=# select count (distinct a) from test Count-10 (1 row) Time: 2375.473 ms (00Time 02.375) 5, compared with whether or not to use HLL acceleration (a large number of unique values, HLL performance improvement is significant) 5.1, not using hllpostgres=# set citus.count_distinct_error_rate to 0; count-10000000 (1 row) Time: 5826241.205 ms (01Ze37VO6.241)

Each node sends a maximum of 1 billion / 128pieces of data to coordinator. The delay is understandable. On the other hand, coordinator can be deduplicated while receiving (postgresql 11 adds parallel gather, merge sort and other capabilities, citus coordinator can learn from it), there is no need to wait for all the data to be reduplicated.

5.2.Use hllpostgres=# set citus.count_distinct_error_rate to 0.005; postgres=# select count (distinct (AMagnec)) from test; count-9999995 (1 row) Time: 4468.749 ms (00distinct 04.469) 6. Set different precision parameters, performance comparison newdb=# set citus.count_distinct_error_rate to 0.1; newdb=# select count (distinct (aid,bid)) from pgbench_accounts Count-94778491 (1 row) Time: 545.301 ms newdb=# set citus.count_distinct_error_rate to 0.01; newdb=# select count (distinct (aid,bid)) from pgbench_accounts; count-100293937 (1 row) Time: 554.333 ms-recommended setting 0.005 newdb=# set citus.count_distinct_error_rate to 0.005 Newdb=# select count (distinct (aid,bid)) from pgbench_accounts; count-100136086 (1 row) Time: 1053.070 ms (01.053) newdb=# set citus.count_distinct_error_rate to 0.001; newdb=# select count (distinct (aid,bid)) from pgbench_accounts Count-100422107 (1 row) Time: 9287.934 ms (00Time 09.288) this ends the study of "how to calculate count in a distributed database". I hope you can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.