In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces "what is the InnoDB full-text index". In the daily operation, I believe many people have doubts about what the InnoDB full-text index is. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "what is the InnoDB full-text index?" Next, please follow the editor to study!
InnoDB full text Index: N-gram Parser
March 26, 2015MySQLShaohua Wang
InnoDB's default full-text index, parser, is ideal for Latin because Latin is segmented by spaces. But for languages like Chinese, Japanese and Korean, there is no such separator. A word can be made up of multiple words, so we need to deal with it in different ways. In MySQL 5. 7. 6, we can use a new full-text indexing plug-in to handle them: n-gram parser.
What is N-gram?
In a full-text index, n-gram is a sequence of n consecutive words in a paragraph. For example, using n-gram to segment the word "information system", the result is as follows:
N-gram example
one
two
three
4N=1: 'letter', 'interest', 'Department', 'Tong'
Ninten2: 'information', 'information system', 'system'
Numb3: 'information system', 'information system'
Numb4: 'information system'
How to use N-gram Parser in InnoDB?
N-gram parser is loaded into MySQL by default and can be used directly. We only need to use WITH PARSER ngram when creating a full-text index in DDL. For example, the following SQL statement can be run on MySQL 5.7.6 and later.
N-gram DDL example
one
two
three
four
five
six
seven
eight
9mysql > CREATE TABLE articles
(
FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY
Title VARCHAR (100)
FULLTEXT INDEX ngram_idx (title) WITH PARSER ngram
) Engine=InnoDB CHARACTER SET utf8mb4
Query OK, 0 rows affected (0.06 sec)
Mysql > # ALTER TABLE articles ADD FULLTEXT INDEX ngram_idx (title) WITH PARSER ngram
Mysql > # CREATE FULLTEXT INDEX ngram_idx ON articles (title) WITH PARSER ngram
We introduce a new global variable called ngram_token_size. It determines the size of n in n-gram, that is, the size of words. Its default value is 2, and at this time, we are using bigram. Its legal values range from 1 to 10. Now, it's natural to think of a question: how to set the size of the ngram_token_ size value in practice? Of course, we recommend using 2. But you can also choose any legal value by following a simple rule: set to the size of the smallest word you want to query. If you want to find a single word, then we need to set it to 1. The smaller the value of ngram_token_size, the less space the full-text index takes up. In general, queries that are exactly equal to the words of ngram_token_size are faster, but queries for words or phrases that are longer than them are slower.
N-gram word segmentation processing
N-gram parser differs from the system's default full-text index parser in the following ways:
Word size check: because of ngram_token_size, innodb_ft_min_token_size and innodb_ft_max_token_size will not apply to n-gram.
Stopword processing: usually, for a new word, we look up the stopwords table to see if there are any matching words. If there is, the word will not be added to the full-text index. But in n-gram, we look up the stopwords table to see if it contains the words. The reason for this is that there are many meaningless characters, words and punctuation marks in the texts of China, Japan and South Korea. For example, if we add'of'to the stopwords table, then for the sentence 'information system', by default we get the result of word segmentation as' information', 'system'. Among them, 'information' and 'system' are filtered out.
We can query INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE and INFORMATION_SCHEMA.INNODB_FT_TABLE_TABLE to find out which words are in the full-text index. This is a very useful debugging tool. If we find that a document containing a word does not appear in the query results as we expected, then the word may not be in the full-text index for some reason. For example, it contains stopword, or its size is smaller than ngram_token_size, and so on. At this time, we can confirm by querying these two tables. Here is a simple example:
Simple debugging example
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
15mysql > INSERT INTO articles (title) VALUES ('Information Systems')
Query OK, 1 row affected (0.01sec)
Mysql > SET GLOBAL innodb_ft_aux_table= "test/articles"
Query OK, 0 rows affected (0.00 sec)
Mysql > SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE
+-+ +
| | WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION | |
+-+ +
| | Information | 1 | 1 | 1 | 1 | 0 | |
| | Information system | 1 | 1 | 1 | 1 | 3 | |
| | system | 1 | 1 | 1 | 1 | 6 |
+-+ +
3 rows in set (0.00 sec)
N-gram query processing
Text query (Text Searches)
In natural language mode (NATURAL LANGUAGE MODE), the query of text is transformed into the union of n-gram participle query. For example, ('information system') is converted to ('information system'). Here is an example:
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
16mysql > INSERT INTO articles (title) VALUES ('information system'), ('information'), ('system'), ('information system')
Query OK, 6 rows affected (0.01sec)
Records: 6 Duplicates: 0 Warnings: 0
Mysql > SELECT * FROM articles WHERE MATCH (title) AGAINST ('Information Systems' IN NATURAL LANGUAGE MODE)
+-+ +
| | FTS_DOC_ID | title |
+-+ +
| | 1 | Information system |
| | 6 | Information system |
| | 2 | Information system |
| | 3 | Information system |
| | 4 | Information |
| | 5 | system |
+-+ +
6 rows in set (0.01 sec)
In the Boolean pattern (BOOLEAN MODE), the text query is transformed into a phrase query of n-gram participle. For example, ('information system') is converted to ('information system').
one
two
three
four
five
six
7mysql > SELECT * FROM articles WHERE MATCH (title) AGAINST ('Information Systems' IN BOOLEAN MODE)
+-+ +
| | FTS_DOC_ID | title |
+-+ +
| | 1 | Information system |
+-+ +
1 row in set (0.00 sec)
Wildcard query (Wildcard Searches)
If the prefix length is smaller than ngram_token_size, the query results will return all n-gram words prefixed with this word in the full-text index.
one
two
three
four
five
six
seven
eight
nine
10mysql > SELECT * FROM articles WHERE MATCH (title) AGAINST ('letter *' IN BOOLEAN MODE)
+-+ +
| | FTS_DOC_ID | title |
+-+ +
| | 1 | Information system |
| | 2 | Information system |
| | 3 | Information system |
| | 4 | Information |
+-+ +
4 rows in set (0.00 sec)
If the length of the prefix is greater than or equal to ngam_token_size, the query is converted to a phrase (phrase search) and the wildcard is ignored. For example, ('information *') is converted to ('information'), and ('information system *') to ('information information system').
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
18mysql > SELECT * FROM articles WHERE MATCH (title) AGAINST ('Information *' IN BOOLEAN MODE)
+-+ +
| | FTS_DOC_ID | title |
+-+ +
| | 1 | Information system |
| | 2 | Information system |
| | 3 | Information system |
| | 4 | Information |
+-+ +
4 rows in set (0.00 sec)
Mysql > SELECT * FROM articles WHERE MATCH (title) AGAINST ('Information system *' IN BOOLEAN MODE)
+-+ +
| | FTS_DOC_ID | title |
+-+ +
| | 1 | Information system |
+-+ +
1 row in set (0.00 sec)
Phrase query (Phrase Searches)
Phrase query is transformed into phrase query of n-gram participle. For example, ('information system') is converted to ('information information system').
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
15mysql > SELECT * FROM articles WHERE MATCH (title) AGAINST ('Information Systems' IN BOOLEAN MODE)
+-+ +
| | FTS_DOC_ID | title |
+-+ +
| | 1 | Information system |
+-+ +
1 row in set (0.00 sec)
Mysql > SELECT * FROM articles WHERE MATCH (title) AGAINST ('Information Systems' IN BOOLEAN MODE)
+-+ +
| | FTS_DOC_ID | title |
+-+ +
| | 2 | Information system |
+-+ +
1 row in set (0.01sec) at this point, the study on "what is the InnoDB full-text index" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.