Citus Series 7-topn acceleration (count (*) group by order by count (*) de 04/28 Update SLTechnology News&Howtos

Citus Series 7-topn acceleration (count () group by order by count () de

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Background

Count (*) group by order by count (*) desc limit x is used to count topn.

Topn is an important indicator of operations, such as the top 10 active users.

If the amount of data is very large, statistics can be time-consuming. Citus provides a topn plug-in, which is similar to HLL. The core is to use a small amount of space to store the data in the aggregation process, while returning a JSONB of a fixed size (parameter setting topn.number_of_counters), which can be used for the next aggregation. (note that with PostgreSQL 11 supporting a more powerful hashagg parallel, aggregating large amounts of data is no longer a problem.)

The topn plug-in aggregation process is shown in the figure.

The results of topn can be aggregated again using topn_union_agg.

Postgres=#\ df topn* List of functions Schema | Name | Result data type | Argument data types | Type-+- -public | topn | SETOF topn_record | jsonb Integer | normal public | topn_add | jsonb | jsonb, text | normal public | topn_add_agg | jsonb | text | agg public | topn_add_trans | internal | internal, text | normal public | topn_pack | jsonb | internal | normal public | topn_union | jsonb | jsonb Jsonb | normal public | topn_union_agg | jsonb | jsonb | agg public | topn_union_trans | internal | internal, jsonb | normal (8 rows)-starting from nothing, record that we saw an "a" select topn_add ('{}','a') -- = > {"a": 1}-- record the sighting of another "a" select topn_add (topn_add ('{}','a'),'a');-- = > {"a": 2}-- for normal_rand create extension tablefunc;-- count values from a normal distribution SELECT topn_add_agg (floor (abs (I):: text) FROM normal_rand (1000, 5,0.7) I -- = > {"2": 1, "3": 74, "4": 420, "5": 425, "6": 77, "7": 3}

Get the value of topn directly from topn jsonb

Postgres=# select (topn (topn_union_agg (agg_prodid), 5). * from reviews_by_prodid; item | frequency-+-509594 | 66 497599 | 59 505217 | 58 461257 | 58 403111 | 57 (5 rows) use topn

1. Install topn software on all nodes (including coordinator, worker)

Cd. / var/lib/pgsql/.bash_profile git clone https://github.com/citusdata/postgresql-topn cd postgresql-topn USE_PGXS=1 make USE_PGXS=1 make install

2. Install the plug-in (coordinator)

Postgres=# create extension topn; CREATE EXTENSION

3. Install the plug-in (worker), call run_command_on_workers in coordinator and execute it in all worker.

Postgres=# select run_command_on_workers ('create extension topn;') Run_command_on_workers-(xxx.xxx.xxx.224,1921,t, "CREATE EXTENSION") (xxx.xxx.xxx.225,1921,t, "CREATE EXTENSION") (xxx.xxx.xxx.226,1921,t, "CREATE EXTENSION") (xxx.xxx.xxx.227,1921,t "CREATE EXTENSION") (xxx.xxx.xxx.229,1921,t, "CREATE EXTENSION") (xxx.xxx.xxx.230,1921,t, "CREATE EXTENSION") (xxx.xxx.xxx.231,1921,t, "CREATE EXTENSION") (xxx.xxx.xxx.232,1921,t, "CREATE EXTENSION") (8 rows) Test

1. Test table

2. Write 200 million test data

Vi test.sql\ set gid random_gaussian (1mel 100000l 2.5)\ set prodid random_gaussian (1jm 1000000 jade 2.5)\ set C1 random (1j3000)\ set c2 random (1m 100000000) insert into tbl (gid,prodid,c1,c2) values (: gid,:prodid,:c1,:c2); pgbench-M prepared-n-r-P 1-f. / test.sql-c 64-j 64-T 1200postgres=# select count (*) from tbl Count-216524755 (1 row) Time: 421.860 ms

3. Several sets of real TOPN data

Postgres=# select gid,count (*) from tbl group by gid order by count (*) desc limit 10; gid | count-+-494 | 438102 | 438017 | 437929 | 437852 | 437546 | 437469,495 | 437458,490 | 437,320496 | 437257,500 | 437239 (10 rows) postgres=# select C1 count (*) from tbl group by C1 order by count (*) desc limit 10 C1 | count-+-1370 | 73175 | 73121 1016 | 73114 1816 | 73045 1463 | 73020,585 | 72986,1529 | 72974,1857 | 72944,2580 | 72930298 | 72917 (10 rows) postgres=# select prodid,count (*) from tbl group by prodid order by count (*) desc limit 10 Prodid | count-+-516916 | 534 481914 | 534 520680 | 527 530544 | 526 449685 | 523 493560 | 523 520464 | 523 502098 | 522 495170 | 522 501695 | 522 (10 rows)

4. Gid dimension estimation topn (the number of unique values of gid is less than or equal to the parameter topn.number_of_counters)

The result is accurate.

CREATE TABLE reviews_by_gid (agg jsonb); SELECT create_reference_table ('reviews_by_gid'); INSERT INTO reviews_by_gid SELECT topn_add_agg (gid::text) FROM tbl; postgres=# select (topn (agg,5)). * from reviews_by_gid Item | frequency-+-494 | 438102 499 | 438017 514 | 437929 506 | 437852 511 | 437546 (5 rows)

5. Prodid dimension estimation topn (the number of unique values of prodid is far greater than or equal to the parameter topn.number_of_counters)

The result is very large.

CREATE TABLE reviews_by_prodid (agg_prodid jsonb); SELECT create_reference_table ('reviews_by_prodid'); INSERT INTO reviews_by_prodid SELECT topn_add_agg (prodid::text) FROM tbl; postgres=# select (topn (agg_prodid,5)). * from reviews_by_prodid Item | frequency-+-470098 | 36 531880 | 35 451724 | 34 420093 | 34 522676 | 33 (5 rows)

6. C1 dimension estimation topn (the number of C1 unique values is slightly greater than or equal to the parameter topn.number_of_counters)

The result is not accurate.

CREATE TABLE reviews_by_c1 (aggc1 jsonb); SELECT create_reference_table ('reviews_by_c1'); INSERT INTO reviews_by_c1 SELECT topn_add_agg (c1::text) FROM tbl; postgres=# select (topn (aggc1,5)). * from reviews_by_c1 Item | frequency-+-2580 | 37073 1016 | 36162 1983 | 35311 1752 | 35285 2 354 | 34 740 (5 rows) precision and truncation

The reasons for the above accuracy deviation:

When the topn hashtable is full and new values are written, it will cause half of the elements (item, count) pairs in the hashtable to be cleared (the smaller half sorted by count).

The TopN approximation algorithm keeps a predefined number of frequent items and counters. If a new item already exists among these frequent items, the algorithm increases the item's frequency counter. Else, the algorithm inserts the new item into the counter list when there is enough space. If there isn't enough space, the algorithm evicts the bottom half of all counters. Since we typically keep counters for many more items (e.g. 100mm N) than we are actually interested in, the actual top N items are unlikely to get evicted and will typically have accurate counts.

You can increase the algoritm's accuracy by increasing the predefined number of frequent items/counters.

Corresponding code

/ * PruneHashTable removes some items from the HashTable to decrease its size. It finds * minimum and maximum frequencies first and removes the items which have lower frequency * than the average of them. * / static void PruneHashTable (HTAB * hashTable, int itemLimit, int numberOfRemainingElements) {Size topnArraySize = 0; int topnIndex = 0; FrequentTopnItem * sortedTopnArray = NULL; bool itemAlreadyHashed = false; HASH_SEQ_STATUS status; FrequentTopnItem * currentTask = NULL; FrequentTopnItem * frequentTopnItem = NULL; int index = 0; int hashTableSize = hash_get_num_entries (hashTable) If (hashTableSize key, currentTask- > key, sizeof (frequentTopnItem- > key)); frequentTopnItem- > frequency = currentTask- > frequency; sortedTopnArray [topnIndex] = * frequentTopnItem; topnIndex++;} qsort (sortedTopnArray, hashTableSize, sizeof (FrequentTopnItem), compareFrequentTopnItem); for (index = numberOfRemainingElements; index)

< hashTableSize; index++) { FrequentTopnItem *topnItem = &(sortedTopnArray[index]); hash_search(hashTable, (void *) topnItem->

Key, HASH_REMOVE, & itemAlreadyHashed);}} how to modify hash table sizepostgres=# load 'topn'; LOAD postgres=# show topn.number_of_counters; topn.number_of_counters-1000 (1 row) set topn.number_of_counters = 20000

Need to operate on all nodes (coordinator+worker), for example.

Postgresql.conf shared_preload_libraries='citus,topn,pg_stat_statements' topn.number_of_counters=10000 summary best practices

1. It is recommended to aggregate in stages, and ensure that the number of unique values of the fields aggregated in each stage is less than topn.number_of_counters, otherwise it will be distorted.

For example, if there are 10, 000 active users per hour, it is recommended that topn.number_of_counters be set to 10, 000 or more, and aggregate by hour. Save an aggregated jsonb result every hour. When you need to count the results of the day, aggregate the jsonb of the whole day.

2. When the number of elements is greater than topn.number_of_counters, the topn result will be distorted.

Referenc

Https://github.com/citusdata/postgresql-topn

Https://docs.citusdata.com/en/v7.5/develop/reference_sql.html

"PostgreSQL count-min sketch top-n probability calculation plug-in cms_topn (combining windows to achieve year-on-year, ring ratio, sliding window analysis, etc.)-one of the core functions of stream computing"

Original address: https://github.com/digoal/blog/blob/master/201809/20180914_01.md

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.