Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the function of GIN index in PostgreSQL

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the function of GIN index in PostgreSQL". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the function of GIN index in PostgreSQL"?

The main use of GIN index is to speed up the speed of full-text retrieval full-text search.

Full-text retrieval

The purpose of full-text retrieval full-text search is to find documents that match the retrieval criteria (document) from the document set. In the search engine, if there are many matching documents, then you need to find the most matching documents, but in the database query, you can find the ones that meet the criteria.

In PG, documents are converted to a specific type of tsvector, including lexemes and their location in the document, for search purposes. Morphemes Lexemes are those that transform the word forms (i.e. participles) that are suitable for query. For example:

Testdb=# select to_tsvector ('There was a crooked man, and he walked a crooked mile'); to_tsvector -' crook':4,10 'man':5' mile':11 'walk':8 (1 row)

From this example, we can see that after word segmentation, crook/man/mile and walk appear, and their positions are 4, 10, 10, 5, and 11, respectively. At the same time, you can also see that words such as there are ignored because they are stop words (from a search engine point of view, these words are too common to be recorded), which is configurable.

The query in PG full-text retrieval is represented by tsquery. The query condition consists of one or more queries using and (\ &) / or (|) / not (!) A morpheme connected by an equal operator. Similarly, use parentheses to clarify the priority of the operation.

Testdb=# select to_tsquery ('man & (walking | running)'); to_tsquery-'man' & (' walk' | 'run') (1 row)

Operator @ @ for full-text retrieval

Testdb=# select to_tsvector ('There was a crooked man, and he walked a crooked mile') @ @ to_tsquery (' man & (walking | running)');? column?-t (1 row) select to_tsvector ('There was a crooked man, and he walked a crooked mile') @ @ to_tsquery (' man & (going | running)'); column?-f (1 row)

Introduction to GIN

GIN is the abbreviation of Generalized Inverted Index universal inverted index, such as familiar with search engine, this concept is not difficult to understand. The values of the data types it operates on are made up of elements rather than atoms. Such a data type becomes a compound data type. The elements in the data value are indexed.

For example, the index at the end of the book provides each term with a list of pages that contain where the term appears. The access method (AM) needs to ensure quick access to index elements, so these elements are stored in a similar Btree, referencing an ordered collection of data rows containing compound values (containing elements) linked to each element. Ordering is not important for data retrieval (such as TIDs sorting), but it is important for the internal structure of the index.

Elements are not removed from the GIN index, and some people may think that the values that contain elements can disappear / add / change, but the set of elements that make up these elements is mostly stable. This way of processing greatly simplifies the algorithm for multi-processes to use indexes.

If the TIDs is small, it can be stored in the same page as the element (called posting list), but if the linked list is large, a more efficient data structure such as Btree will be used and stored in separate data pages (called posting tree).

Therefore, the GIN index contains an Btree,TIDs Btree containing elements or a normal linked list is linked to the leaf row of the Btree.

Like the GiST and SP-GiST indexes discussed earlier, GIN provides an interface for application developers to support various operations on composite data types.

For example, here is the table ts, which creates a GIN index for ts:

Testdb=# drop table if exists ts;psql: NOTICE: table "ts" does not exist, skippingDROP TABLEtestdb=# create table ts (doc text, doc_tsv tsvector); CREATE TABLEtestdb=# truncate table ts; slitter.'), ('I am a sheet slitter.'), ('I slit sheets.'), ('I am the sleekest sheet slitter that ever slit sheets.'), ('She slits the sheet she sits on.'); update ts set doc_tsv = to_tsvector (doc); create index on ts using gin (doc_tsv) TRUNCATE TABLEtestdb=# insert into ts (doc) valuestestdb-# ('Can a sheet slitter slit sheets?'), testdb-# (' How many sheets could a sheet slitter slit?'), testdb-# ('I slit a sheet, a sheet I slit.'), testdb-# ('Upon a slitted sheet I sit.'), testdb-# (' Whoever slit the sheets is a good sheet slitter.'), testdb-# ('I am a sheet slitter.'), testdb-# ('I slit sheets.') Testdb-# ('I am the sleekest sheet slitter that ever slit sheets.'), testdb-# ('She slits the sheet she sits on.') INSERT 0 9testdb=# testdb=# update ts set doc_tsv = to_tsvector (doc); UPDATE 9testdb=# testdb=# create index on ts using gin (doc_tsv); CREATE INDEX

Here, a black background (page number + page inner offset) is used instead of an arrow to indicate a reference to TIDs.

Unlike regular Btree, because there is only one way to traverse, GIN indexes are joined by one-way linked lists, not two-way linked lists.

Testdb=# select ctid, left (doc,20), doc_tsv from ts Ctid | left | doc_tsv-+-- -(0Jing 10) | Can a sheet slitter | 'sheet':3,6' slit':5 'slitter':4 (0Magazine 11) | How many sheets coul |' could':4 'mani':2' sheet':3,6 'slit':8' slitter':7 (0Jol 12) | I slit a sheet A sh | 'sheet':4,6' slit':2,8 (0Magne13) | Upon a slitted sheet | 'sheet':4' sit':6 'slit':3' upon':1 (0Magne14) | Whoever slit the she | 'good':7' sheet':4,8 'slit':2' slitter':9 'whoever':1 (0Magne15) | I am a sheet slitter |' sheet':4 'slitter':5 (0Magne16) | I slit sheets. | | 'sheet':3' slit':2 (0rows 17) | I am the sleekest sh | 'ever':8' sheet':5,10 'sleekest':4' slit':9 'slitter':6 (0 rows 18) | She slits the sheet |' sheet':4 'sit':6' slit':2 (9 rows) |

In this example, sheet/slit/slitter uses Btree storage and other elements use a simple linked list.

If we want to know the number of elements, how do we get them?

Testdb=# select (unnest (doc_tsv)) .lexeme, count (*) from tstestdb-# group by 1 order by 2 desc; lexeme | count-+-sheet | 9 slit | 8 slitter | 5 sit | 2 upon | 1 mani | 1 whoever | 1 sleekest | 1 good | 1 could | 1 ever | 1 (11 rows)

Here is an example of how to scan through an GIN index:

Testdb=# explain (costs off) testdb-# select doc from ts where doc_tsv @ @ to_tsquery ('many & slitter') QUERY PLAN-Seq Scan on ts Filter: (doc_tsv @ @ to_tsquery ('many & slitter'::text)) (2 rows) testdb=# set enable_seqscan=off SETtestdb=# explain (costs off) select doc from ts where doc_tsv @ @ to_tsquery ('many & slitter') QUERY PLAN-Bitmap Heap Scan on ts Recheck Cond: (doc_tsv @ @ to_tsquery ( 'many & slitter'::text))-> Bitmap Index Scan on ts_doc_tsv_idx Index Cond: (doc_tsv @ @ to_tsquery (' many & slitter'::text)) (4 rows)

To execute this query, you first need to extract a single morpheme (lexeme): there is a special API function in mani/slitter.PG, which takes into account the data types and usage scenarios determined by op class.

Testdb=# select amop.amopopr::regoperator, amop.amopstrategytestdb-# from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amoptestdb-# where opc.opcname = 'tsvector_ops'testdb-# and opf.oid = opc.opcfamilytestdb-# and am.oid = opf.opfmethodtestdb-# and amop.amopfamily = opc.opcfamilytestdb-# and am.amname =' gin'testdb-# and amop.amoplefttype = opc.opcintype Amopopr | amopstrategy-- +-@ @ (tsvector,tsquery) | 1 @ (tsvector,tsquery) | 2 (2 rows)

Going back to this example, in the morpheme Btree, the next step is to retrieve the key at the same time and enter the TIDs linked list to get:

Mani-(0Pol 2)

Slitter-(0pr 1), (0je 2), (1pm 2), (1pm 3), (2je 2)

For each TID found, consistency function API is called, and the function determines whether the found row matches the retrieval key. Because the query is AND, it only returns (0 minute 2).

Testdb=# select doc from ts where doc_tsv @ @ to_tsquery ('many & slitter'); doc-How many sheets could a sheet slitter slit? (1 row)

Slow Update

DML the columns of GIN index (mainly insert & update) is quite slow, and each document usually contains many morphemes that need to be indexed. Therefore, although only one document is added or updated, a large number of index trees need to be updated. In other words, if multiple documents are updated at the same time, the morphemes in those documents may be the same, so the total consumption may be less than updating documents one by one.

PG provides the fastupdate option. When this parameter is turned on, the update will be processed in a separate unordered linked list, and the index will be updated when the linked list exceeds the threshold (parameter: gin_pending_list_limit or index storage parameter with the same name). This technique also has negative effects, one is that it reduces the query efficiency (additional scanning of the linked list), and the other is that an update happens to encounter an index update, so the update will take a relatively long time.

Limiting the query result

One of the features of GIN AM usually returns bitmap instead of TID one by one, so the execution plan is bitmap scan.

This characteristic Hu causes the LIMIT clause to be less effective:

Testdb=# explain verbose select doc from ts where doc_tsv @ @ to_tsquery ('many & slitter') QUERY PLAN-Bitmap Heap Scan on public. Ts (cost=12.25..16.51 rows=1 width=32) Output: doc Recheck Cond: (ts.doc_tsv @ @ to_tsquery ('many & slitter'::text))-> Bitmap Index Scan on ts_doc_tsv_idx (cost=0.00..12.25 rows=1 width=0) Index Cond: (ts.doc_tsv @ @ to_tsquery (' many & slitter'::text)) (5 rows) testdb=# explain verboseselect doc from ts where doc_tsv @ @ To_tsquery ('many & slitter') limit 1 QUERY PLAN -- Limit (cost=12.25..16.51 rows=1 width=32) Output: doc-> Bitmap Heap Scan on public.ts (cost=12.25..16.51 rows=1 width=32) Output: doc Recheck Cond: (ts.doc_tsv @ @ to_tsquery ('many & slitter'::text))-> Bitmap Index Scan on ts_doc_tsv_idx (cost=0.00..12.25 rows=1 width=0) Index Cond : (ts.doc_tsv @ @ to_tsquery ('many & slitter'::text)) (7 rows)

This is because the startup cost of Bitmap Heap Scan is not much different from that of Bitmap Index Scan.

Based on this situation, PG provides the gin_fuzzy_search_limit parameter to control the number of result rows returned (default is 0, that is, all are returned).

Testdb=# show gin_fuzzy_search_limit; gin_fuzzy_search_limit-0 (1 row) so far, I believe you have a deeper understanding of "what is the role of GIN index in PostgreSQL". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report