How to implement a search engine in MySQL 07/13 Update SLTechnology News&Howtos

How to implement a search engine in MySQL

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article shows you how to achieve a search engine in MySQL, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Preface

Only Innodb and myisam storage engines can use full-text indexing (innodb supports full-text indexing starting with mysql5.6)

Char, varchar, text type fields can create full-text index (fulltext index type)

Full-text indexing is based on keywords, how to distinguish different keywords, it is necessary to use participle (stopword)

English words are segmented with spaces and commas; Chinese word segmentation is not convenient (a sentence does not know how to distinguish different keywords)

The built-in word segmentation parser ngram supports Chinese, Japanese and Korean (dividing sentences into fixed-number phrases)

When writing a large amount of data to a table, it is faster to create a full-text index after writing the data (reducing the overhead of maintaining the index)

The principle of full-text indexing is an inverted index (a data structure) that generally uses associative arrays to store the mapping of words to the location in the document in an auxiliary table.

Use

Use MATCH ()... Search in AGAINST mode

Match () indicates that the column is searched, and against indicates that the string is searched

Check the default participle (use these words to distinguish different keywords); you can also customize the participle to distinguish different keywords.

SELECT * FROM information_schema.INNODB_FT_DEFAULT_STOPWORD

Such as

+-+ | value | +-+ | a | | about | | an | | are | | as | | at | | be | | by | | com | | de | | en | | for | | from |

Three types of full-text search

Natural language search (Natural language search)

Pass a specific string through MATCH AGAINST for checking, default mode

Boolean search (Boolean search)

Add operators to the retrieved string, such as "+" means it must be included, "-" does not contain, "*" means wildcard, even if the passed string is small or appears in the stop word, it will not be filtered out.

Query expansion search (query extended search)

The search string is used to perform a natural language search, and then the word of the most relevant line returned by the search is added to the search string and searched again, and the query returns the line from the second search

Related parameters

Configure related parameters

Innodb_ft_min_token_size

Default 3, which means a minimum of 3 characters as a keyword. Increasing this value reduces the size of the full-text index.

Innodb_ft_max_token_size

The default is 84, which means a maximum of 84 characters as a keyword. Limiting this value reduces the size of the full-text index.

Ngram_token_size

Default 2, which indicates that 2 characters are used as a keyword in the built-in word segmentation parser, such as establishing a full-text index on "abcd". The key word is' ab','bc','cd',. When using the ngram word segmentation parser, innodb_ft_min_token_size and innodb_ft_max_token_size are invalid

Note: none of these three parameters can be modified dynamically. If you modify these parameters, you need to restart the MySQL service and re-establish the full-text index.

Here is a copy of the "complete MySQL development specifications" into the big factory must see, recommended to see. Follow the official account Java technology stack and reply to mysql for more tutorials.

Testing the innodb engine using full-text indexing

Prepare for

1. Goal

Query whether an article contains a keyword; the number of times a keyword appears in a series of articles

Query whether the title of the article contains a keyword

2. Set the following parameters to reduce disk IO pressure

SET GLOBAL sync_binlog=100; SET GLOBAL innodb_flush_log_at_trx_commit=2

3. Import 1kw data for test full-text indexing

The data source is searched on the Internet.

Extraction code: 60l7

4. The structure of an article table

CREATE TABLE `originle` (`id`bigint (10) NOT NULL, `url`varchar (1024) CHARACTER SET latin1 NOT NULL DEFAULT'', `title`varchar (1024) NOT NULL DEFAULT'', `source`varchar''COMMENT' COMMENT', `keywords` varchar (32) DEFAULT NULL, `publish_ time`timestamp NULL DEFAULT NULL, PRIMARY KEY (`id`), KEY `title_ idx` (`title`) ENGINE=InnoDB

Use myloader multithreading to import test data, and decompress the test data first

Tar-zxf mydumper_dump_article.tar.gz time myloader-u $user-p $passwd-S $socket-t 32-d / datas/dump_article-v 3

5. The total amount of data and the size of data files and index files after data import

SELECT COUNT (*) FROM `roomle` +-+ | COUNT (*) | +-+ | 10000000 | +-+ 1 row in set (7.85 sec) SELECT table_name, CONCAT (FORMAT (SUM (data_length) / 1024 / 1024),'M') AS dbdata_size, CONCAT (FORMAT (SUM (index_length) / 1024 / 1024),'M') AS dbindex_size CONCAT (FORMAT (SUM (data_length + index_length) / 1024 / 1024 / 1024 and table_name='article' 2),'G') AS `db_size (G) `, AVG_ROW_LENGTH,table_rows,update_time FROM information_schema.tables WHERE table_schema = DATABASE () and table_name='article' +-+ | table_name | dbdata_size | dbindex_size | | db_size (G) | AVG_ROW_LENGTH | table_rows | update_time | +-| -+ | article | 3710.00m | 1003.00m | 4.60g | 414g | 9388739 | 2019-07-05 15:31:37 | +- -+

Create a full-text index using the default method

1. The table is already related to the key word field (a brief description of the content of the article), and uses "," as the word separator.

2. Search for a keyword without a full-text index

A full table scan is required

Select count (*) from article where keywords like'% Fashion%'; +-+ | count (*) | +-+ | 163 | +-+ 1 row in set (7.56 sec)

3. Create a full-text index of the keyword field (as a participle)

Set innodb_ft_min_token_size in the my.cnf configuration file and restart the MySQL service (minimum two characters as one keyword, default three characters as one keyword)

[mysqld] innodb_ft_min_token_size=2

3.1 set custom stopwords (i.e. participle)

USE mysql; CREATE TABLE my_stopwords (VALUE VARCHAR (30)) ENGINE = INNODB; INSERT INTO my_stopwords (VALUE) VALUE (','); SET GLOBAL innodb_ft_server_stopword_table = 'mysql/my_stopwords'; SHOW GLOBAL VARIABLES WHERE Variable_name IN (' innodb_ft_min_token_size','innodb_ft_server_stopword_table') +-- +-+ | Variable_name | Value | +-- -+ | innodb_ft_min_token_size | 2 | | innodb_ft_server_stopword_table | mysql/my_stopwords | +-+-+

3.2 create a full-text index

Alter table article add fulltext index idx_full_keyword (keywords); * [] Query OK, 0 rows affected, 1 warning (1 min 27.92 sec) * [] Records: 0 Duplicates: 0 Warnings: 1

3.3 the remaining disk space needs to be sufficient. The original table is 4.6g and the remaining 5.7g disk. Adding a full-text index will also fail.

Df-h Filesystem Size Used Avail Use% Mounted on / dev/vda1 7.8G 6.3G 1.2G 85% / tmpfs 1.9G 0 1.9G 0% / dev/shm / dev/mapper/vg_opt-lvol0 19G 12G 5.7G 68% / datas creates a temporary file 8.6k Jul 5 with the original table size 16:19 # sql-5250_3533.frm 4.4G Jul 5 16:20 # sql-ib117-1768830977.ibd alter table article add fulltext index idx_full_keyword (keywords) ERROR 1114 (HY000): The table 'article' is full

3.4 query the number of occurrences of a keyword using the created full-text index

Query response time has been greatly improved, only 0.05s; using where keywords like'% Fashion% 'takes 7.56s

Select count (*) from article where match (keywords) against ('% Fashion%'); +-+ | count (*) | +-+ | 163,163 | +-+ 1 row in set (0.05 sec)

3.5 if you need to match multiple keywords at the same time, use Boolean full-text search

Indicates the number of records that exactly match "Sanlitun, Beijing":

Select count (*) from article where match (keywords) against ('+ Sanlitun, Beijing'in boolean mode) +-+ | count (*) | +-+ | 1 | +-+ 1 row in set (0.06 sec) indicates the number of records matching "Sanlitun" or "Beijing" select count (*) from article where match (keywords) against ("Sanlitun, Beijing") +-+ | count (*) | +-+ | 8 | +-+ 1 row in set (0.06 sec)

3.6 after the full-text index is created, some other files are created

96K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_1.ibd 96K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_2.ibd 96K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_3.ibd 96K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_4.ibd 128K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_5.ibd 256K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_6.ibd 96K Jul 5 16:29 FTS_00000000000000a7_BEING_DELETED_CACHE.ibd 96K Jul 5 16:29 FTS_00000000000000a7_BEING_DELETED.ibd 96K Jul 5 16:30 FTS_00000000000000a7_CONFIG.ibd 96K Jul 5 16:29 FTS_00000000000000a7_DELETED_CACHE.ibd 96K Jul 5 16:29 FTS_00000000000000a7_DELETED.ibd

The first 6 represent inverted indexes (secondary index tables)

The 7th and 8th represents a document ID (DOC_ID) that contains deleted documents, whose data is currently being deleted from the full-text index

The 9th message that represents the internal state of the FULLTEXT index

10th and 11th represent documents that contain deleted data but have not yet been deleted from the full-text index

Create a full-text index using the ngram word segmentation parser

1. Set up a full-text index to the title field (there is no fixed stopwords participle in this field, use the ngram participle parser)

You need to set ngram_token_size in the my.cnf configuration file (2 characters by default as the keyword of ngram), and restart the mysql service.

The default 2 is used here

Select title from article limit 10 +-- + | title | + -+ | worth IT | | Launchpad Jiangnanpi Leather factory small show | | A "madman" is carried back to the background at a rare moment behind the scenes of Raw | | Raw: the prince scolded your father that you are a green tea boy with a dozen of four | | four groups of 30 square meters hardcover small households. A large number of pictures Attached picture | | Sexy smoked Cat Eye make-up of Queen of Night Club | | Big Boulder Johnson | | Cui Xiuying clothing Science Pu Lin of Girl Age Yun'er Huang Meiying, Jin Tae-yeon, Zheng Xiujing | | Deyang outdoor outing Huatian self-help BBQ | +-+

2. Create a full-text index on the title field

Alter table article add fulltext index ft_index_title (title) with parser ngram; Query OK, 0 rows affected (3 min 29.22 sec) Records: 0 Duplicates: 0 Warnings: 0

3. An inverted index will be created (the longer the title field, the larger the inverted index will be created)

112M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_1.ibd 28M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_2.ibd 20M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_3.ibd 140M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_4.ibd 128M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_5.ibd 668M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_6.ibd

4. Do not set up a full-text index to search for a keyword in title

Select count (*) from article where title like'% Outdoor%'; +-+ | count (*) | +-+ | 22058 | +-+ 1 row in set (8.60 sec) select count (*) from article where title like'% background%' +-+ | count (*) | +-+ | 1142 | +-+

5. Search for a keyword using a full-text index

Response time has been greatly improved

Select count (*) from article where match (title) against ('Outdoor'); +-+ | count (*) | +-+ | 22058 | +-+ 1 row in set (0.07 sec) select count (*) from article where title like'% background%' +-+ | count (*) | +-+ | 1142 | +-+ 1 row in set (8.31 sec)

6. Note that inconsistencies will occur when the number of keywords searched is greater than 2 (ngram_token_size definition size).

General search, the actual number of records that appear this keyword is 6 select count (*) from article where title like'% Prince big%' +-+ | count (*) | +-+ | 6 | +-+ 1 row in set (8.40 sec) full-text search. The number of records with keywords is 9443 select count (*) from article where match (title) against ('Gongzi Da') +-+ | count (*) | +-+ | 9443 | +-+ 1 row in set (0.06 sec) the number of records in which the keyword actually appears is 1 select count (*) from article where title like'% Huatian self-help%' +-+ | count (*) | +-+ | 1 | +-+ 1 row in set (8.33 sec) full-text search the number of records with this keyword is 3202 select count (*) from article where match (title) against ("Huatian self-help") +-+ | count (*) | +-+ | 3202 | +-+ 1 row in set (0.06 sec)

Conclusion

When there is a fixed stopword participle (English space character, Chinese ","-", etc.) in a field of mysql, the full-text index of this field can be established to quickly search the relevant record information of a certain keyword and achieve the effect of a simple search engine * *

When a mysql field does not have a fixed stopword participle, the built-in parser ngram can be used to quickly search the field value into a fixed number of keywords (ngram_token_size defined size). When the number of characters of the searched keyword is not equal to the ngram_token_size defined size, there will be problems that are inconsistent with the actual situation.

The full-text index can be searched quickly, and there is also the overhead of maintaining the index; the larger the length of the field, the larger the full-text index created, which will affect the throughput of DML statements. You can use a special full-text search engine ES to do this.

The above is how to implement a search engine in MySQL. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.