What are the advantages of the PostgreSQL search plug-in 07/02 Update SLTechnology News&Howtos

What are the advantages of the PostgreSQL search plug-in

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article focuses on "what are the advantages of the PostgreSQL search plug-in", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what are the advantages of the PostgreSQL search plug-in?"

Git clone https://github.com/postgrespro/rum cd rum. / var/lib/pgsql/.bash_profile USE_PGXS=1 make USE_PGXS=1 make install create extension rum

1. UDF API for generating random floating-point arrays

Create or replace function gen_rand_float4 (int,int) returns float4 [] as $$select array (select (random () * $1):: float4 from generate_series (1meme 2)); $$language sql strict

2. Build a table and index

Create unlogged table t_rum (id int primary key, arr float4 []); create index idx_t_rum_1 on t_rum using rum (arr)

4. Write random floating point array

Vi test.sql\ set id random (1m 2000000000) insert into t_rum values (: id, gen_rand_float4 (10je 16)) on conflict (id) do nothing;pgbench-M prepared-n-r-P 1-f. / test.sql-c 64-j 64-t 10000000postgres=# select * from t_rum limit 2 Id | arr-+-- -182025544 | {5.07998, 6.80827, 5.42024, 2.53619, 4.10843, 0. 5321980.33886pr 9.60262pr 6.68369pr 8.01305PME 9.60298pr 8.087pr 1.25819pr 6.54424pr 5.3444} 51515704 | {0.123099pr 9.266260.00549683pr 9.01483pr 0.91166pr 3.44338pr 4.55135pr 4.65002pr 0.820029je 1.9343pr 3.00254PM1.28121998831.85269pr 6.39579} (2 rows) postgres=# select count (*) from t_rum Count-3244994 (1 row)

5. Use the array similarity search provided by rum (element overlap rate calculation)

Postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_rum order by arr'{5.07998 select * from t_rum order by arr'{5.0799851 * from t_rum order by arr'{5.0799856.808271.4202452.53619 4.1084331 limit 6.33886 6.60262pm 9.602988.0871.25819Power6.54424,6.04902 5.3444} 'MIT 1 QUERY PLAN-- - -Limit (cost=22435.67..22435.68 rows=1 width=97) (actual time=12527.447..12527.450 rows=1 loops=1) Output: id Arr, (arr'{5.07998meme 6.80827 rep. 5.42024) Buffers: shared hit=50450-> Sort (cost=22435.67..29469.15 rows=3244994 width=97) (actual time=12527.445..12527.446 rows=1 loops=1) Output: id, arr ((arr'{5.07998Power6.80827) Sort Key: (t_rum.arr'{5.0799886 6.04902) Sort Key: (t_rum.arr'{5.0799886 6.808275242.808275242.808272652.10843author0.5198338866.6836966.013059.6029888.01305, 9.60298, 8.0871951, 1.25819re6.54424, 6.04902, 5.3444}': real2): top-: (Realms): N heapsort Memory: 25kB Buffers: shared hit=50450-> Seq Scan on public.t_rum (cost=0.00..8368.72 rows=3244994 width=97) (actual time=0.054..11788.483 rows=3244994 loops=1) Output: id Arr, (arr'{5.07998meme 6.80827): real []) Buffers: shared hit=50447 Planning Time: 0.115 ms Execution Time: 12527.498 ms (13 rows)

You will find that the index is gone, but not fast. A large number (50447) of index PAGE were scanned.

The reason is that we do not care about the threshold, resulting in scanning a large number of index BLOCK. The default threshold is 0.5, which is too low.

Postgres=# show rum.array_similarity_threshold postgres-#; rum.array_similarity_threshold-- 0.5 (1 row)

Set to 0.9 to output only arrays that are more than 90% similar (overlap). Performance skyrocketed and fewer blocks of data were scanned.

Postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_rum where arr%'{5.07998 select * from t_rum where arr%'{5.0799836 * from t_rum where arr%'{5.0799831 @ 0.5321886 4.10843 @ 0.5321981.33886pr 9.6026986 1.0369pr 6.01305MAL 9.6029881 5.0841991 6.04902pr 5.3444} 'order by arr' {5.07998MIT 6.80827PM5.42024WAY 4.33886MAX 9.6026261 6.68369pc8.087MAE 9.6087MAL 1.258196.5444902and5.3444} 'limit 1 QUERY PLAN-- - -Limit (cost=1.54..1.56 rows=1 width=97) (actual time=0.664..0.664 rows=0 loops=1) Output: id Arr, (arr'{5.07998meme 6.80827 rep. 5.42024) Buffers: shared hit=128 read=40-> Index Scan using idx_t_rum_1 on public.t_rum (cost=1.54..87.65 rows=3245 width=97) (actual time=0.662..0.662 rows=0 loops=1) Output: id, arr (arr'{5.07998Magol 6.80827 5.42024 2.53619MAY 4.10843 re 0.5321984.33886Power9.60269With 8.01305Met 9.60298Min 8.087MIT 1.25819pr 6.04902Med 5.3444}':: real [] Index Cond: (t_rum.arr%'{5.07998MAL 6.80827MIT 5.42024jue 2.5361964.1084MAX 0.53219843886je 9.6026.6868369je 8.01305Mo 9.6029858): real []: Order By: (t_rum) .arr'{5.07998meme 6.80827 Buffers 5.42024 2.53619 repartee 4.10843 Maxim 0.5321980.33886meme 9.60262pr 6.68369Met 8.01305pr 9.60298Med 8.087MIT 1.25819pr 6.54424pr 6.04902m5.3444}':: real []) Buffers: shared hit=128 read=40 Planning Time: 0.184 ms Execution Time: 0.691 ms (10 rows) element overlap search optimization

1. Adjust the threshold and solve the problem step by step.

"PostgreSQL similarity search Design and performance-millisecond similarity search practice for address, QA, POI and other texts"

In fact, the image eigenvalue is approximately searched, and there is also room for optimization, so let's get to the point.

Deploy imgsmlr (on PG 11)

1. Assume that PG 11 is installed by yum

2. Clone source code

Yum install-y git git clone https://github.com/postgrespro/imgsmlr cd imgsmlr

3. Modify the header file

Vi imgsmlr.h / / add # ifndef FALSE # define FALSE (0) # endif # ifndef TRUE # define TRUE (! FALSE) # endif

4. Install dependent image conversion package

Yum install-y gd-devel

5. Compile and install the IMGSMLR plug-in

. / var/lib/pgsql/.bash_profile USE_PGXS=1 make USE_PGXS=1 make install single node single table image search (400m images)

1. Create a UDF that generates signature of random image eigenvalues.

Create or replace function gen_rand_img_sig (int) returns signature as $$select ('('| | rtrim (array (select (random () * $1):: float4 from generate_series (1je 16)):: text,' {'),'}')):: signature; $language sql strict;postgres=# select * from gen_rand_img_sig (10) Gen_rand_img_sig-- -(6.744310 5.105020, 0.087113, 3.808010, 8.129480, 2.834540, 2.495250, 0.940481, 0.033208, 6.583490, 2.840330, 1.422440,6.683830,0.080847,8.327730,2.471430) (1 row) postgres=# select * from gen_rand_img_sig (10) Gen_rand_img_sig-- -(3.013650 6.170690, 0.601905, 2.692030, 1.268540, 7.803740, 9.757770, 5.537750, 0.391753, 4.440790, 1.201580, 5.501380,6.166980,0.240686,9.768680,2.911290) (1 row)

2. build a table and an index of image eigenvalues

Create table t_img_sig (id int primary key, sig signature); create index idx_t_img_sig_1 on t_img_sig using gist (sig)

3. Write about 400 million random image eigenvalues

Vi testsig.sql\ set id random (12000000000) insert into t_img_sig values (: id, gen_rand_img_sig (10)) on conflict (id) do nothing;pgbench-M prepared-n-r-P 1-f. / testsig.sql-c 32-j 32-t 20000000postgres=# select * from t_img limit 10 Id | sig-+- - -- 47902935 | (5.861920 1.062770, 8.318020, 2.205840, 0.202951, 6.956610, 1.413190, 2.898480, 8.961630, 6.377800, 1.110450, 6.684520,2.286290, 7.850760,1.832650,0.074348) 174656795 | (2.165030,0.183753,9.913950,9.208260, 5.165660,6.603510,2.008380,8.117910, 2.358590,5.466330, 9.139280,8.893700, 4.664190,9.361670,9.016990 2.271000) 96186891 | (9.605980, 4.395920, 4.336720, 3.174360, 8.706960, 0.155107, 9.408940, 4.531100, 2.783530, 5.681780,9.792380,6.428320,2.983760,9.733290, 7.635160,7.035780) 55061667 | (7.567960,5.874530,5.222040,5.638520,3.488960,8.770750, 7.054610,7.239630, 9.202280,9.465020,4.079080,5.729770,0.475227) 8.434800, 6.873730, 5.140080) 64659434 | (4.860650, 3.984440, 3.009900, 5.116680, 6.489150, 4.224800, 0.609752, 8.731120, 6.577390, 8.542540,9.096120,8.976700, 8.9360002.836270,7.1862506.264300) 87143098 | (4.801570,7.870150,0.939599,3.666670,1.1023405.819580,6.511330,6.430760,0.584531,3.024190,6.255460) 8.823820, 5.076960, 0.181344, 8.137380, 1.230360) 109245945 | (7.541850, 7.201460, 6.858400, 2.605210, 1.283090, 7.525200,4.213240,8.413760,9.707390,1.916970,1.719320,1.255280,9.006780,4.851420,2.168250,5.997360) 4979218 | (8.463000, 4.051410,9.057320,1.367980,3.344340,7.032640,8.583770,1.873090,5.524810) 0.187254, 5.783270, 6.141040, 2.479410, 6.406450, 9.371700, 0.050690) 72846137 | (7.018560, 4.039150, 9.114800, 2.911170, 5.531180,8.557330, 6.739050,0.103649,3.691390,7.584640,8.184180,0.599390,9.037130,4.090610, 4.369770,6.480000) 36813995 | (4.643480,8.704640,1.073880,2.665530,3.298300,9.2442805.768050) 0.887555, 5.990350, 2.991390, 6.186550, 6.464940, 6.187140, 0.150242, 2.123070, 2.932270) (10 rows) Time: 58.101 ms

Write about 439 million image eigenvalues.

Postgres=# select count (*) from tweeimg sign; count-438924137 (1 row)

4. Enter an image eigenvalue to search for the most similar image.

Explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig order by sig'(5.07998, 6.80827, 5.42024, 2.53619, 4.10843, 0.532198, 4.33886, 9.60262, 6.68369, 8.01305, 9.60298, 8.087, 1.25819, 6.54424, 6.04902, 5.3444) 'limit 1 Postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig where signature_distance (sig,' (5.07998, 6.80827, 5.80827, 5.42024, 2.53619, 4.10843,) 0.532198, 4.33886, 9.60262, 6.68369, 8.01305, 9.60298, 8.087, 1.25819, 6.04424, 6.04902, 5.3444') > 0.9 order by sig'(5.07998, 6.80827, 5.808272, 2.52024, 4.10843, 6.602, 6.68369, 8.01305, 9.60298, 8.08771, 196.54424, 6.0902, 5.3444) 1 'limit'. QUERY PLAN - -Limit (cost=0.48..0.51 rows=1 width=72) (actual time=4094.810..4094.812 rows=1 loops=1) Output: id Sig, (sig'(5.079980, 6.808270, 5.420240, 2.536190, 4.108430, 0.532198, 4.338860, 9.602620, 6.683690, 8.013050, 9.602980,8.087000,1.258190,6.544240,6.049020) 5.344400): signature)) Buffers: shared hit=205999-> Index Scan using idx_t_img_sig_1 on public.t_img_sig (cost=0.48..5361351.06 rows=146395778 width=72) (actual time=4094.808..4094.808 rows=1 loops=1) Output: id, sig, (sig'(5.079980, 6.808270, 5.420240, 2.536190, 4.108430, 0.532198, 4.338860, 9.602620,6.683690,8.013050,9.602980,8.087000 1.258190, 6.544240, 6.049020, 5.344400):: signature) Order By: (t_img_sig.sig'(5.079980, 6.808270, 5.420240, 2.536190, 4.108430, 0.532198, 4.338860, 9.602620,6.683690,8.013050,9.602980,8.087000,1.258190,6.544240,6.049020,5.344400): signature) Filter: (signature_distance (t_img_sig.sig) '(5.079980, 6.808270, 5.420240, 2.536190, 4.108430, 0.532198, 4.338860, 9.602620, 6.683690, 8.013050, 9.602980,8.087000,1.258190,6.544240,6.049020,5.344400):: 0.9'::double precision) Buffers: shared hit=205999 Planning Time: 4194.485 ms (10 rows) performance and bottleneck

Performance: 439 million image eigenvalues, search the image in about 4.2 seconds.

Bottleneck:

1. A large number of index pages were scanned (205999).

Optimization idea

1. Compression accuracy, such as using 3 decimal places. According to users, there is a 10-fold improvement in performance.

The accuracy is optimized as follows, using a new function to generate image eigenvalues and using 3 decimal places.

Create or replace function gen_rand_img_sig3 (int) returns signature as $$select ('('| | rtrim (array (select trunc ((random () * $1):: numeric,3) from generate_series (1mai 16)):: text,' {'),'}')):: signature; $$language sql strict

Examples are as follows

Postgres=# select gen_rand_img_sig3 (10) Gen_rand_img_sig3-- -(2.984000 3.323000, 4.083000, 6.292000, 5.008000, 9.029000, 6.208000, 1.141000, 1.796000, 9.257000, 1.397000, 1.235000, 7.157000, 3.745000, 0.112000, 7.723000) (1 row)

2. Use partition table + dblink asynchronous interface to call in parallel. (it is better to directly support imgsmlr gist index scan parallelism at the kernel level)

The next introduction

3. Use citus sharding. Multi-computer, improve the overall computing power. Because a large number of index pages are scanned, even if CPU does not have a bottleneck, memory bandwidth will become a bottleneck in the future. Multiple computers can solve this problem. )

The next introduction

4. At the kernel level, the support for lower-dimensional signature is now 16. For example, if the support is reduced to 4, the performance can also be improved.

Precision phenomenon

1. Scan a small amount of INDEX PAGE when there are records that can be matched exactly.

Postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig order by sig'(3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053 000, 2.566000, 4.522 000, 6.929 000, 1.58 2000, 2.179 000)':: signature limit 1 QUERY PLAN- - -Limit (cost=0.48..0.49 rows=1 width=72) (actual time=1.596..1.598 rows=1 loops=1) Output: id Sig, (sig'(3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053 000, 2.566000, 4.522 000, 6.929 000, 1.582000) 2.179000):: signature)) Buffers: shared hit=125-> Index Scan using t_img_sig1_sig_idx on public.t_img_sig (cost=0.48..7318159.22 rows=785457848 width=72) (actual time=1.594..1.595 rows=1 loops=1) Output: id, sig, (sig'(3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000 4.522000, 6.929000, 1.582000, 2.179000):: signature) Order By: (t_img_sig.sig'(3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000) 2.179000)': signature) Buffers: shared hit=125 Planning Time: 0.072 ms Execution Time: 1.621 ms (9 rows)

2. When a small amount of content is modified, a small number of values match exactly, and other values do not match exactly, the scanned INDEX PAGE increases.

Postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig order by sig'(3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053 000, 2.566000, 4.522 000, 6.929000, 1.582000, 2.179001)':: signature limit 1 QUERY PLAN- - -Limit (cost=0.48..0.49 rows=1 width=72) (actual time=7.051..7.052 rows=1 loops=1) Output: id Sig, (sig'(3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053 000, 2.566000, 4.522 000, 6.929 000, 1.582000) 2.179001):: signature)) Buffers: shared hit=454-> Index Scan using t_img_sig1_sig_idx on public.t_img_sig (cost=0.48..7324626.56 rows=786152016 width=72) (actual time=7.049..7.049 rows=1 loops=1) Output: id, sig, (sig'(3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000 4.522000, 6.929000, 1.582000, 2.179001):: signature) Order By: (t_img_sig.sig'(3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000) 2.179001)': signature) Buffers: shared hit=454 Planning Time: 0.074 ms Execution Time: 7.076 ms (9 rows)

3. When a large number of modified values do not match exactly, a large number of INDEX PAGE needs to be scanned.

Postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig order by sig'(7.727000, 3.594000, 1.185000, 4.996000, 6.950000, 7.129000, 5.429000, 1.520000, 8.219000, 6.222000, 2.013000, 2.536000, 4.532000, 6.939000, 1.538000, 2.178000)':: signature limit 1 QUERY PLAN- - -Limit (cost=0.47..0.48 rows=1 width=72) (actual time=2528.890..2528.891 rows=1 loops=1) Output: id Sig, (sig'(7.727000, 3.594000, 1.185000, 4.996000, 6.950000, 7.129000, 5.429000, 1.520000, 8.219000, 6.222000, 2.013000, 2.536000, 4.532000, 6.939000, 1.538000) 2.178000):: signature)) Buffers: shared hit=121510-> Index Scan using t_img_sig1_sig_idx on public.t_img_sig (cost=0.47..1361409.21 rows=146121007 width=72) (actual time=2528.887..2528.888 rows=1 loops=1) Output: id, sig, (sig'(7.727000, 3.594000, 1.185000, 4.996000, 6.950000, 7.129000, 5.429000, 1.520000, 8.219000, 6.222000, 2.013000, 2.536000 4.532000, 6.939000, 1.538000, 2.178000):: signature) Order By: (t_img_sig.sig'(7.727000, 3.594000, 1.185000, 4.996000, 6.950000, 7.129000, 5.429000, 1.520000, 8.219000, 6.222000, 2.013000, 2.536000, 4.532000, 6.939000, 1.538000 2.178000)': signature) Buffers: shared hit=121510 Planning Time: 0.092 ms Execution Time: 2582.558 ms (9 rows) I believe that everyone on the "PostgreSQL search plug-in what are the advantages" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.