In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
How to realize full-text search in PostgreSQL? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
CREATE TABLE author (id SERIAL PRIMARY KEY, name TEXT NOT NULL); CREATE TABLE post (id SERIAL PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, author_id INT NOT NULL references author (id)); CREATE TABLE tag (id SERIAL PRIMARY KEY, name TEXT NOT NULL); CREATE TABLE posts_tags (post_id INT NOT NULL references post (id), tag_id INT NOT NULL references tag (id)) INSERT INTO author (id, name) VALUES (1, 'Pete Graham'), (2,' Rachid Belaid'), (3, 'Robert Berry'); INSERT INTO tag (id, name) VALUES (1,' scifi'), (2, 'politics'), (3,' science') INSERT INTO post (id, title, content, author_id) VALUES (1, 'Endangered species',' Pandas are an endangered species', 1), (2, 'Freedom of Speech',' Freedom of speech is a necessary right missing in many countries', 2), (3, 'Star Wars vs Star Trek',' Few words from a big fan', 3); INSERT INTO posts_tags (post_id, tag_id) VALUES (1,3), (2,2), (3,1)
This is a blog-like application. It has a post table with title and content fields. Post is associated to the author through a foreign key. Post itself has multiple tags (tag).
What is full-text search?
First, let's look at the definition:
In text retrieval, full-text search refers to the technology of searching single or multiple documents (document) stored by computer from the full-text database. Full-text search is different from metadata-based search or search based on the original text in the database.
Wikipedia
It is important that the concept of documentation is introduced into this definition. When you search for data, you are looking for meaningful entities you want to find, and these are your documents. It is well explained in the documentation of PostgreSQL.
Document is the search unit in full-text search system. For example, an impurity article or an e-mail message.
-- Postgres document
The document here can span multiple tables and represent the logical entity we want to search.
Build our documentation (document)
In the previous section, we introduced the concept of documentation. The document has nothing to do with the schema of the table, but with the data, combining the fields into a meaningful entity. According to the schema of the table in the example, our document (document) consists of these:
Post.title
Post.content
Author.name of post
All tag.name associated with post
To generate documents based on these requirements, the SQL query should look like this:
SELECT post.title | |'| | post.content | |'| | author.name | |'| | coalesce ((string_agg (tag.name,''),'') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id Document-Endangered species Pandas are an endangered species Pete Graham politics Freedom of Speech Freedom of speech is a necessary right missing in many countries Rachid Belaid politics Star Wars vs Star Trek Few words from a big fan Robert Berry politics (3 rows)
Since we are grouped with post and author, and because there are multiple tag associated with a post, we use string_agg () as the aggregate function. Even if author is a foreign key and a post cannot have more than one author, it is required to add an aggregate function to the author or author to the GROUP BY.
We also used coalesce (). It's a good idea to use the coalesce () function when the value can be NULL, otherwise the result of string concatenation will be NULL.
So far, our document is just a long string, which is useless. We need to convert it to the correct format with to_tsvector ().
SELECT to_tsvector (post.title) | | to_tsvector (post.content) | | to_tsvector (author.name) | | to_tsvector (coalesce ((string_agg (tag.name,'')),'') as documentFROM postJOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON tag.id = posts_tags.tag_idGROUP BY post.id, author.id Document-'endang':1,6' graham':9 'panda':3' pete':8 'polit':10' speci':2,7'belaid':16 'countri':14' freedom':1,4 'mani':13' miss':11 'necessari':9' polit':17' Rachid':15 'right':10' speech':3,6'berri':13 'big':10' fan':11 'polit':14' robert':12 'star':1,4' trek':5 'vs':3' war':2 'word':7 (3 rows)
This query will return documents in tsvector format that are suitable for full-text search. Let's try to convert a string to a tsvector.
SELECT to_tsvector ('Try not to become a man of success, but rather try to become a man of value')
This query will return the following result:
To_tsvector----'becom':4,13 'man':6,15' rather':10 'success':8' tri':1,11 'valu':17 (1 row)
Something strange happened. First of all, there are fewer words than the original, and some words have changed (try has become tri), followed by numbers. What's going on?
A tsvector is an ordered list of standard lexemes (sorted list). Standard lexemes (distinct lexeme) means that all variants of the same word are standardized the same.
The standardization process almost always replaces uppercase letters with lowercase letters, and often removes suffixes (such as the English words spencil and ing). This allows you to search for various variants of the same word instead of tediously entering all possible variants.
The number indicates the position of the word bit in the original string, such as "man" at positions 6 and 15. You can count it yourself.
The default text search for to_tesvetor in Postgres is "English". It ignores stopword words (words such as am is are an an) in English.
This explains why there are fewer words in tsvetor than in the original sentence. We will see more language and text search configurations later.
Query
We know how to build a document, but our goal is to search for a document. We can use the @ @ operator when searching for tsvector. See here for instructions. Take a look at several examples of querying documents.
> select to_tsvector ('If you can dream it, you can do it') @ @ 'dream';? column?- t (1 row) > select to_tsvector (' It''s kind of fun to do the impossible') @ @ 'impossible';? column?- f (1 row)
The second query returns false because we need to build a tsquery that transforms the string (cast) into tsquery when using the @ @ operator. The difference between this l transformation and the use of to_tsquery () is shown below.
SELECT 'impossible'::tsquery, to_tsquery (' impossible'); tsquery | to_tsquery-+- 'impossible' |' imposs' (1 row)
But the lexeme of "dream" is the same as itself.
SELECT 'dream'::tsquery, to_tsquery (' dream'); tsquery | to_tsquery-+- 'dream' |' dream' (1 row)
From now on, we will use to_tsquery to query documents.
SELECT to_tsvector ('It''s kind of fun to do the impossible') @ @ to_tsquery (' impossible');? column?- t (1 row)
Tsquery stores the terms to search for, and you can use the & (and), | (or), and! (non) logical operators. You can use parentheses to group operators.
> SELECT to_tsvector ('If the facts don't fit the theory, change the facts') @ @ to_tsquery ('! Fact');? column?- f (1 row) > SELECT to_tsvector ('If the facts don''t fit the theory, change the facts') @ @ to_tsquery ('theory &! fact');? column?- f (1 row) > SELECT to_tsvector (' If the facts don''t fit the theory, change the facts.') @ @ to_tsquery ('fiction | theory') ? column?- t (1 row)
We can also use: * to express a query that begins with a word.
> SELECT to_tsvector ('If the facts don''t fit the theory, change the facts.') @ @ to_tsquery ('theo:*');? column?- t (1 row)
Now that we know how to use a full-text search query, let's go back to the original table schema and try to query the document.
SELECT pid, p_titleFROM (SELECT post.id as pid, post.title as p_title, to_tsvector (post.title) | | to_tsvector (post.content) | | to_tsvector (author.name) | | to_tsvector (coalesce (string_agg (tag.name)) )) as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id) p_search WHERE p_search.document @ @ to_tsquery ('Endangered & Species') Pid | pid title-1 | Endangered species (1 row)
This query will find words in the document that contain Endangered and Species or close.
Language support
Postgres's built-in text search function supports multiple languages: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
SELECT to_tsvector ('english',' We are running'); to_tsvector- 'run':3 (1 row) SELECT to_tsvector (' french','We are running'); to_tsvector-- 'are':2' running':3 'we':1 (1 row)
Based on our original model, column names can be used to create tsvector. Suppose the post table contains content from different languages, and it contains a column of language.
ALTER TABLE post ADD language text NOT NULL DEFAULT ('english')
To use the language column, let's now recompile the document.
SELECT to_tsvector (post.language::regconfig, post.title) | | to_tsvector (post.language::regconfig, post.content) | | to_tsvector ('simple', author.name) | | to_tsvector (' simple', coalesce ((string_agg (tag.name,')),'') as documentFROM postJOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON tag.id = posts_tags.tag_idGROUP BY post.id Author.id
If the displayed conversion character:: regconfig is missing, an error is generated during the query:
ERROR: function to_tsvector (text, text) does not exist
Regconfig is an object identifier type that represents an Postgres text search configuration item. Http://www.postgresql.org/docs/9.3/static/datatype-oid.html
The semantics of the document are now compiled in the correct language in post.language.
We also use simple, which is also a text search configuration item provided by Postgres. Simple does not ignore the disabled word list, nor does it try to find the root of a word. With simple, each set of characters separated by spaces is a semantic; for data, simple text search configuration items are useful, just like a person's name, and we may not want to find the root of the name.
SELECT to_tsvector ('simple',' We are running'); to_tsvector-- 'are':2' running':3 'we':1 (1 row)
Accented character
You also need to consider stress when you build a search engine that supports multiple languages. Stress is very important in many languages and can change the meaning of the word. It is useful that Postgres comes with a unaccent extension to invoke unaccentuate content.
CREATE EXTENSION unaccent;SELECT unaccent ('è é ê?'); unaccent- eeee (1 row)
Let's add some accented your content to our post table.
INSERT INTO post (id, title, content, author_id, language) VALUES (4,'il é tait une fois','il é tait une fois un h?tel.', 2)
If we want to ignore the stress when we create the document, we can simply do the following:
SELECT to_tsvector (post.language, unaccent (post.title)) | | to_tsvector (post.language, unaccent (post.content)) | | to_tsvector ('simple', unaccent (author.name)) | | to_tsvector (' simple', unaccent (coalesce (string_agg (tag.name,') JOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON author.id = post.author_idGROUP BY p.id
If you work in this way, it will be a bit troublesome if there is more room for error. We can also create a new text search configuration that supports unaccented characters.
CREATE TEXT SEARCH CONFIGURATION fr (COPY = french); ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPINGFOR hword, hword_part, word WITH unaccent, french_stem
When we use this new text search configuration, we can see the lexemes
SELECT to_tsvector ('french',' il é tait une fois'); to_tsvector- 'fois':4 (1 row) SELECT to_tsvector (' fr','il é tait une fois'); to_tsvector- 'etait':2' fois':4 (1 row)
This gives us the same result, first as applying unaccent and building tsvector from the result.
SELECT to_tsvector ('french', unaccent (' il é tait une fois')); to_tsvector- 'etait':2' fois':4 (1 row)
The number of morphemes is different because il é tait une is a useless word in France. Is this a question to make these words stop in our files? I don't think so. Etait is not a really useless word but a spelling mistake.
SELECT to_tsvector ('fr',' Hattortel') @ @ to_tsquery ('hotels') as result; result- t (1 row)
If we create an unaccented search configuration for each language so that our post can be written and we keep this value in post.language, then we can keep the previous document query.
SELECT to_tsvector (post.language, post.title) | | to_tsvector (post.language, post.content) | | to_tsvector ('simple', author.name) | | to_tsvector (' simple', coalesce (string_agg (tag.name,')) JOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON author.id = post.author_idGROUP BY p.id
If you need to create an unaccented text search configuration for each language supported by Postgres, then you can use gist
Our current document size may increase because it can include unstressed useless words but we don't pay attention to stressed character queries. This may be useful if someone uses an English keyboard to search for French content.
Classify
When you create a search engine you want to search for relevant results (based on relevance), classification can be based on many factors, and its documentation roughly explains the content.
Categorization attempts to handle a specific context search, so when there are many pairs, the one with the highest relevance will be ranked first. PostgreSQL provides two predefined classification functions that take into account lexical interpretation, proximity and structural information; they take into account the frequency of words in the context, how to approach the same words in the context, and where they appear and how important they are in the text.
-- PostgreSQL documentation
Get the correlation results we want through some of the functions provided by PostgreSQL, and in our example we will use two of them: ts_rank () and setweight ().
The function setweight allows us to assign a value to the degree of importance (weight) through the tsvector function; the value can be'A','C', or'D'.
SELECT pid, p_titleFROM (SELECT post.id as pid, post.title as p_title, setweight (to_tsvector (post.language::regconfig, post.title),'A') | | setweight (to_tsvector (post.language::regconfig, post.content),'B') | | setweight (to_tsvector ('simple', author.name),' C') | setweight (to_tsvector ('simple', coalesce (string_agg (tag.name,') 'B') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id) p_searchWHERE p_search.document @ @ to_tsquery (' english', 'Endangered & Species') ORDER BY ts_rank (p_search.document, to_tsquery (' english', 'Endangered & Species')) DESC
In the above query, we assign different weights in different columns of the article. Post.title is more important than post.content and tag combined. The least important thing is author.name.
This means that if we search for the keyword "Alice", the document that contains the keyword in the title will rank first in the search results, followed by the document that contains these keywords in the content, and finally the document that contains these keywords in the author's name.
Based on the weight allocation of each part of the document ts_rank () this function returns a floating-point number that represents the correlation between the document and the query keywords.
SELECT ts_rank (to_tsvector ('This is an example of document'), to_tsquery (' example | document')) as relevancy; relevancy- 0.0607927 (1 row) SELECT ts_rank (to_tsvector ('This is an example of document'), to_tsquery (' example')) as relevancy Relevancy- 0.0607927 (1 row) SELECT ts_rank (to_tsvector ('This is an example of document'), to_tsquery (' example | unkown')) as relevancy; relevancy- 0.0303964 (1 row) SELECT ts_rank (to_tsvector ('This is an example of document'), to_tsquery (' example & document')) as relevancy Relevancy- 0.0985009 (1 row) SELECT ts_rank (to_tsvector ('This is an example of document'), to_tsquery (' example & unknown')) as relevancy; relevancy- 1e-20 (1 row)
However, the concept of relevance is vague and related to specific applications. Different applications may need additional information to get the desired sorting results, such as the modification time of the document. Built-in sorting functions such as asts_rank are just an example. You can write your own sorting function and / or mix the results with other factors to suit your own specific needs.
To make it clear, if we want the new article to be more important than the old one, we can divide the value of the ts_rank function by the age of the document + 1 (to prevent it from being divided by 0).
Optimization and indexing
Optimize the search results in a table to advance in a straight line. PostgreSQL supports index-based functionality, so you can easily create GIN indexes with the tsvector () function.
CREATE INDEX idx_fts_post ON post USING gin (setweight (to_tsvector (language, title),'A') | | setweight (to_tsvector (language, content),'B'))
GIN or GiST index? These two indexes will be the subject of blog posts related to them. GiST exports an incorrect match and then requires an additional table row lookup to verify the match. On the other hand, GIN can be found faster but will be larger and slower when created.
As a rule of thumb, GIN indexes are suitable for static data because lookups are fast. For dynamic data, GiST can be updated faster. Specifically, GiST indexes are good for dynamic data and fast if individual words (lexemes) are below 100000, while GIN indexes are better at dealing with more than 100000 words, but updates are slower.
Postgres documents: chapter 12 full-text search
In our example, we chose GIN. But this choice is not certain, you can make a decision based on your own data.
There is a problem in our architecture example; it is distributed in different tables with different weights at that time. In order to run better, it is necessary to de-normalize data through triggers and materialized views.
We don't always need to be de-normalized and sometimes we need to add index-based functionality, as we did above. In addition, you can tsvector_update_trigger (...) through the postgres trigger function. Or tsvector_update_trigger_column (...) Realize the non-normalization of the data of the same table. See the Postgres documentation for more detailed information.
There are some acceptable delays in our application before the results are returned. This is a good case of using materialized views to load additional indexes.
CREATE MATERIALIZED VIEW search_index AS SELECT post.id, post.title, setweight (to_tsvector (post.language::regconfig, post.title),'A') | | setweight (to_tsvector (post.language::regconfig, post.content),'B') | | setweight (to_tsvector ('simple', author.name),' C') | setweight ('simple', coalesce (string_agg (tag.name,') 'A') as documentFROM postJOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON tag.id = posts_tags.tag_idGROUP BY post.id, author.id
Then re-indexing the search engine is as simple as running REFRESH MATERIALIZED VIEW search_index on a regular basis.
Now we can index the materialized view.
CREATE INDEX idx_fts_search ON search_index USING gin (document)
The query becomes just as simple.
SELECT id as post_id, titleFROM search_indexWHERE document @ @ to_tsquery ('english',' Endangered & Species') ORDER BY ts_rank (p_search.document, to_tsquery ('english',' Endangered & Species')) DESC
If the delay becomes unbearable, you should explore alternatives to using triggers.
The way to set up document storage is not unique; it depends on your document: single table, multiple tables, multiple languages, data volume.
Thoughtbot.com published the article "Implementing Multi-Table Full Text Search with Postgres in Rails". I suggest you read the following.
Spelling mistakes
PostgreSQL provides a very useful extender, pg_trgm. For related documents, see pg_trgm doc.
CREATE EXTENSION pg_trgm
Pg_trgm supports N-meta-syntax such as Number3. N-meta grammar is useful because it can find similar strings. In fact, this is the definition of misspelling-a similar but incorrect word.
SELECT similarity ('Something',' something'); similarity- 1 (1 row) SELECT similarity ('Something',' samething'); similarity- 0.538462 (1 row) SELECT similarity ('Something',' unrelated'); similarity- 0 (1 row) SELECT similarity ('Something',' everything') Similarity- 0.235294 (1 row) SELECT similarity ('Something',' omething'); similarity- 0.583333 (1 row)
As you can see from the example above, the similarity function returns a floating-point value that represents the similarity between two strings. Spelling error detection is a series of processes that collect lexemes used in documents and compare the similarity between lexemes and input text. I found it appropriate to set the similarity threshold to 0.5 when detecting spelling errors. First, we need to create a list of unique lexemes based on the document, each of which is unique.
CREATE MATERIALIZED VIEW unique_lexeme ASSELECT word FROM ts_stat ('SELECT to_tsvector (' simple', post.title) | | to_tsvector ('simple', post.content) | | to_tsvector (' simple', author.name) | | to_tsvector ('simple', coalesce (string_agg (tag.name,') FROM postJOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON tag.id = posts_tags.tag_idGROUP BY post.id Author.id')
The above script creates a view using the word column, and the word column content comes from a list of lexemes. We use the simple keyword so that text in multiple languages can be stored in the table table. Once we have created this materialized view, we need to add an index to make the similarity query faster.
CREATE INDEX words_idx ON search_words USING gin (word gin_trgm_ops)
Fortunately, the list of unique terms used in search engines does not change quickly, so we do not have to refresh the materialized view frequently through the following script:
REFRESH MATERIALIZED VIEW unique_lexeme
Once we have set up this table, it is easy to find the closest match.
SELECT word WHERE similarity (word, 'samething') > 0.5 ORDER BY word' samething'LIMIT 1
This query returns a semantics that satisfies the similarity (> 0. 5) and ranks the closest one first according to the input samething. The operator returns the "distance" between parameters, and is a value minus similarity ().
When you decide to deal with spelling errors in your search, you don't want to see it appear in every query. On the contrary, when there are no results in your search, you can query for spelling mistakes and use the results provided by the query to give users some suggestions. If the data comes from informal communications, such as social networks, your data may contain spelling errors. You can get a better result by appending a similar semantics to your tsquery.
"Super Fuzzy Searching on PostgreSQL" is a good reference article on using three-letter groups for spelling mistakes and searching for Postgres.
In the example I use, the table that uses unique semantics will not be larger than 2000 rows, and my understanding is that if you use unique semantics when you have more than 1m text, you will encounter performance problems with this method.
About MySQL and RDS (remote data Services)
Does this work on Postgres RDS?
All of the above examples work on RDS. As far as I know, the only limitation in RDS search features is that you need to access the file system when searching for certain data, such as custom dictionaries, spell checkers, synonyms, and thesauri. For more information, please see Amazon aws Forum.
I am using the MYSQL database. Can I use the built-in full-text search function?
If it were me, I wouldn't use this function. Needless to argue, MySQL's full-text search function is very limited. By default, it does not support stemming in any language. I stumbled upon a function that can be installed for stemming, but MYSQL does not support index-based functions.
So what can you do? Given what we discussed above, if Postgres is suitable for all the scenarios you use, consider replacing the database with Postgres. Database migration can be easily done through tools such as py-mysql2pgsql. Or you can look at more advanced solutions such as SOLR (Lucene-based full-text search server) and Elasticsearch (Lucene-based open source, distributed, RESTful search engine).
Summary
We have learned how to build a text search engine that performs well and supports multiple languages based on a special document. This article is just an overview, but it has provided you with enough background knowledge and examples so that you can start building your own search engine. I may have made some mistakes in this article, and I would appreciate it if you could send the error message to blog@lostpropertyhq.com.
Postgres's full-text search feature is very good, and the search speed is fast enough. This allows you to grow the data in your application without having to rely on other tools for processing. Is the search function of Postgres silver bullet? If your core business revolves around search, it may not be.
It removes some features, but you won't use them in most scenarios. There is no doubt that you need to carefully analyze and understand your needs to decide which search method to use.
Personally, I hope the Postgres full-text search feature will continue to improve and add the following features:
Additional built-in language support: Chinese, Japanese.
Foreign data wrapper around Lucene. Lucene is still the best tool for full-text search, and integrating it into Postgres has many benefits.
More improvement or scoring features of ranking results will be first-class. Elasticsearch and SOLR have provided advanced solutions.
It would be great to do fuzzy queries (tsquery) without using trigram. Elasticsearch provides a very simple way to implement fuzzy search queries.
Features such as dictionary content, synonyms, and thesauri can be dynamically created and edited through SQL, instead of adding files to the file system.
Postgres is not as advanced as ElasticSearch and SOLR, after all, ElasticSearch and SOLR are dedicated to full-text search tools, and full-text search is only an excellent feature of PostgresSQL.
This is the answer to the question about how to achieve full-text search in PostgreSQL. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.