Sphinx combined with scws to realize full-text Retrieval of mysql 10/22 Update SLTechnology News&Howtos

Sphinx combined with scws to realize full-text Retrieval of mysql

2025-10-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The following talk about sphinx combined with scws to achieve full-text search of mysql, the secret of the text is close to the topic. So, no gossip, let's just look at the following, I believe that after reading sphinx combined with scws to achieve full-text search of mysql this article will certainly benefit you.

System environment

Hostnam

IP address

Related services

Version

SQL

172.169.18.128

Mysql5.6 (Master)

Sphinx

172.169.18.210

Mysql5.6 (slave) php5.6

Apache2.4

Sphinx version: 2.2.10

Sphinx plug-in: 1.3.3

Scws participle version: 1.2.3

I. brief introduction

1.1.What is Sphinx

Reference address: http://www.sphinxsearch.org/sphinx-tutorial

Sphinx is a full-text search engine developed by Russian Andrew Aksyonoff. It is intended to provide full-text search function with high speed, low space consumption and high result relevance for other applications. Sphinx can be easily integrated with SQL databases and scripting languages. The current system has built-in support for MySQL and PostgreSQL database data sources, and also supports reading XML data in a specific format from standard input.

The speed of Sphinx to create an index is that it takes only 3-4 minutes to create an index of 1 million records, an index of 10 million records can be created in 50 minutes, and an incremental index containing only the latest 100000 records takes only a few seconds to rebuild at a time.

1.2.The features of Sphinx are as follows:

(1) High-speed indexing (on modern CPU, peak performance can reach 10 MB/ seconds)

(2) High performance search (on 2-4GB text data, the average response time for each retrieval is less than 0.1s)

(3) it can handle large amounts of data (it is known to handle more than 100 GB of text data and 100m documents on a single CPU system)

(4) provide excellent correlation algorithm, compound Ranking method based on phrase similarity and statistics (BM25).

(5) support distributed search

(6) phrase search is supported

(7) provide document summary generation

(8) it can provide search service as the storage engine of MySQL.

(9) support multiple retrieval modes such as Boolean, phrase, word similarity, etc.

(10) the document supports multiple full-text retrieval fields (up to 32)

(11) the document supports multiple additional attribute information (for example, grouping information, timestamp, etc.)

(12) support verdict

1.3. Summary

Advantages: high efficiency and high expansibility

Disadvantages: not responsible for data storage

Use the Sphinx search engine to index the data, load it in at once, and then save it in memory. In this way, users only need to retrieve data on the Sphinx CVM when searching. Moreover, Sphinx has better performance without the drawbacks of MySQL's random disk I Dot O.

SCWS is the acronym of Simple Chinese Word Segmentation (simple Chinese word segmentation system).

Reference address: http://www.xunsearch.com/scws/index.php

This is a mechanical Chinese word segmentation engine based on word frequency dictionary, which can segment a whole paragraph of Chinese text into words correctly. Words are the smallest morpheme unit in Chinese, but they are not separated by spaces in writing as in English, so how to segment words accurately and quickly has always been a difficult problem in Chinese word segmentation.

2.2, featur

SCWS is developed in pure C language, does not rely on any external library functions, can directly use dynamic link libraries to embed applications, and supports Chinese codes such as GBK, UTF-8 and so on. In addition, the PHP expansion module is also provided, which can quickly and easily use the word segmentation function in PHP.

There are not many innovative components in the word segmentation algorithm. We use the word frequency dictionary collected by ourselves, supplemented by certain proper names, person names, place names, digital age and other rules to achieve basic word segmentation. after small-scale testing, the accuracy rate is between 90% and 95%, which can basically meet the needs of some small search engines, keyword extraction and other occasions. The first prototype version was released at the end of 2005.

II. Environmental preparation

1. Temporarily turn off the firewall

2. Close seliunx

3. System environment

Centos7.4 mysql5.6 php5.6

Third, set up Sphinx service

1. Install the dependency package

# yum-y install make gcc gcc-c++ libtool autoconf automake mysql-devel

2. Install Sphinx

# yum install expat expat-devel

# wget-c http://sphinxsearch.com/files/sphinx-2.2.10-release.tar.gz

# tar-zxvf sphinx-2.2.10-release.tar.gz

# cd sphinx-2.2.10-release/

#. / configure-prefix=/usr/local/sphinx-with-mysql-with-libexpat-enable-id64

# make & & make install

3. Install the libsphinxclient,php extension using

# cd api/libsphinxclient/

#. / configure-prefix=/usr/local/sphinx/libsphinxclient

# make & & make install

After the installation, check to see if there are three directories bin etc var under / usr/local/sphinx. If so, the installation is correct!

4. Install the PHP extension of Sphinx: mine is 5.6. you need to install sphinx-1.3.3.tgz. If it is below php5.4, you can sphinx-1.3.0.tgz. If php7.0 needs to install sphinx-339e123.tar.gz (http://git.php.net/?p=pecl/search_engine/sphinx.git;a=snapshot;h=339e123acb0ce7beb2d9d4f9094d6f8bcf15fb54;sf=tgz)

# wget-c http://pecl.php.net/get/sphinx-1.3.3.tgz

# tar zxvf sphinx-1.3.3.tgz

# cd sphinx-1.3.3/

# phpize

#. / configure-with-sphinx=/usr/local/sphinx/libsphinxclient/-with-php-config=/usr/bin/php-config

# make & & make install

# after success, you will be prompted:

Installing shared extensions: / usr/lib64/php/modules/

# modify php configuration

# echo "[Sphinx]" > > / etc/php.ini

# echo "extension = sphinx.so" > > / etc/php.ini

# restart httpd service

# systemctl restart httpd.service

5. Create test data for testing (db:jiangjj)

CREATE TABLE IF NOT EXISTS `items` (

`id`int (11) NOT NULL AUTO_INCREMENT

`title`varchar (255) NOT NULL

`content` text NOT NULL

`created` datetime NOT NULL

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT=' full-text search test datasheet 'AUTO_INCREMENT=11

INSERT INTO `items` (`id`, `title`, `content`, `created`) VALUES

(1, 'linux mysql Cluster installation', 'MySQL Cluster is a highly practical, scalable, high-performance, highly redundant version of MySQL suitable for distributed computing environment', '2016-09-07 00 MySQL 0015')

(2, 'mysql Master-Slave replication', 'basic principles of mysql Master-Slave backup (replication) mysql supports one-way, asynchronous replication, in which one CVM acts as the master CVM, while one or more other CVM acts as the slave CVM', '2016-09-06 00 0000')

(3, 'hello',' can you search me?', '2016-09-05 00lv 0000')

(4, 'mysql',' mysql is the best database?', '2016-09-03 00Rule 0000')

(5, 'mysql index','as for the benefits of the MySQL index, if the MySQL that designs and uses the index correctly and reasonably is a Lamborghini, then the MySQL without the design and use of the index is a human tricycle', '2016-09-01 00 MySQL')

(6, 'Cluster', 'with regard to the benefits of MySQL index, if the MySQL that correctly designs and uses the index is a Lamborghini, then the MySQL without the design and use of the index is a human tricycle', '0000-00-0000: 00MySQL')

(9, 'replication principle', 'redis also has replication', '0000-00-0000: 00 00')

(10, 'redis cluster', 'cluster technology is an important means of building a high-performance website architecture. Imagine that while a website is under the pressure of high concurrency, it is also necessary to query the qualified data from the massive data and respond quickly. What we must think of is to slice the data and put the data into multiple different cloud server nodes according to certain rules. To reduce the pressure on a single-node CVM', '0000-00-0000: 00: 00

CREATE TABLE IF NOT EXISTS `sph_ counter` (

`counter_ id` int (11) NOT NULL

`max_doc_ id` int (11) NOT NULL

PRIMARY KEY (`counter_ id`)

) count table marked by ENGINE=MyISAM DEFAULT CHARSET=utf8 COMMENT=' incremental index'

6. Sphinx configuration: pay attention to modifying the data source configuration information.

Configuration address: http://www.cnblogs.com/yjf512/p/3598332.html

The index strategy of "Main + Delta" ("main index" + "incremental index") is adopted below, and the unary participle that comes with Sphinx is used.

# vim / usr/local/sphinx/etc/sphinx.conf

Source items {

Type = mysql

Sql_host = 172.169.18.128

Sql_user = jiangjj

Sql_pass = 123456

Sql_db = jiangjj

Sql_query_pre = SET NAMES utf8

Sql_query_pre = SET SESSION query_cache_type = OFF

Sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX (id) FROM items

Sql_query_range = SELECT MIN (id), MAX (id) FROM items\

WHERE id (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)\

AND id > = $start AND id a.. z, _, a.. z, Utility 410.. Utility 42F-> U+430..U+44F, U+430..U+44F

Preopen = 1

Min_infix_len = 1

}

# incremental indexing

Index items_delta: items {

Source = items_delta

Path = / usr/local/sphinx/var/data/items_delta

}

# distributed indexing

Index master {

Type = distributed

Local = items

Local = items_delta

}

Indexer {

Mem_limit = 256m

}

Searchd {

Listen = 9312

Listen = 9306:mysql41

Log = / usr/local/sphinx/var/log/searchd.log

Query_log = / usr/local/sphinx/var/log/query.log

# compat_sphinxql_magics = 0

Attr_flush_period = 600,

Mva_updates_pool = 16m

Read_timeout = 5

Max_children = 0

Dist_threads = 2

Pid_file = / usr/local/sphinx/var/log/searchd.pid

# max_marches = 1000

Seamless_rotate = 1

Preopen_indexes = 1

Unlink_old = 1

Workers = threads

Binlog_path = / usr/local/sphinx/var/data

}

7. Sphinx creates an index

# cd / usr/local/sphinx/bin/

# rebuilding the index for the first time

#. / indexer-c / usr/local/sphinx/etc/sphinx.conf-- all

# start sphinx

#. / searchd-c / usr/local/sphinx/etc/sphinx.conf

# View the process

# ps-ef | grep searchd

# View status

#. / searchd-c / usr/local/sphinx/etc/sphinx.conf-- status

# turn off sphinx

#. / searchd-c / usr/local/sphinx/etc/sphinx.conf-- stop

8. Index update and usage instructions

The incremental Index is updated every N minutes. An index merge is usually done every night when the load is low, and the "incremental index" is re-established. Of course, if there is not much data in the "main index", you can also directly re-establish the "main index".

When searching in API, both primary index and incremental index are used to obtain quasi-real-time search data. The Sphinx configuration of this article puts "primary index" and "incremental index" into the distributed index master, so you only need to query the distributed index "master" to get all the matching data (including the latest data).

Updating and merging indexes can be done by cron job:

8.1. Edit

# crontab-e

* / 1 * / usr/local/sphinx/shell/index_update.sh

0 3 * / usr/local/sphinx/shell/merge_index.sh

/ / View

# crontab-l

8.2. The script is as follows

# Update script

# vim / usr/local/sphinx/shell/index_update.sh

#! / bin/sh

/ usr/local/sphinx/bin/indexer-c / usr/local/sphinx/etc/sphinx.conf-- rotate items_delta > / dev/null 2 > & 1

# merge script

[root@Sphinx shell] # vim merge_index.sh

#! / bin/bash

Indexer=/usr/local/sphinx/bin/indexer

Mysql= `which mysql`

# host=172.169.18.128

# mysql_user=jiangjj

# mysql_password=123456

QUERY= "use jiangjj;select max_doc_id from sph_counter where counter_id = 2 limit 1;"

Index_counter=$ ($mysql-h272.169.18.128-ujiangjj-p123456-sN-e "$QUERY")

# merge "main + delta" indexes

$indexer-c / usr/local/sphinx/etc/sphinx.conf-- rotate-- merge items items_delta-- merge-dst-range deleted 0 0 > > / usr/local/sphinx/var/index_merge.log 2 > & 1

If ["$?"-eq 0]; then

# # update sphinx counter

If [!-z $index_counter]; then

$mysql-h272.169.18.128-ujiangjj-p123456-Djiangjj-e "REPLACE INTO sph_counter VALUES (1,'$index_counter')"

# # rebuild delta index to avoid confusion with main index

$indexer-c / usr/local/sphinx/etc/sphinx.conf-- rotate items_delta > > / usr/local/sphinx/var/rebuild_deltaindex.log 2 > & 1

# Licensing

# chmod uplix * .sh

After the test is clear, proceed to the next step.

Set up scws (Chinese word Segmentation) service

Download address: http://www.xunsearch.com/scws/download.php

1. Download and install scws: pay attention to the extended version and the version of php

# wget-c http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2

# tar jxvf scws-1.2.3.tar.bz2

# cd scws-1.2.3/

#. / configure-prefix=/usr/local/scws

# make & & make install

2. Installation of PHP extension for scws

# cd. / phpext/

# phpize

#. / configure

# make & & make install

The compilation completion status is as follows:

# modify php configuration file

[root@Sphinx ~] # echo "[scws]" > > / etc/php.ini

[root@Sphinx ~] # echo "extension = scws.so" > > / etc/php.ini

[root@Sphinx ~] # echo "scws.default.charset = utf-8" > > / etc/php.ini

[root@Sphinx ~] # echo "scws.default.fpath = / usr/local/scws/etc/" > > / etc/php.ini

3. Thesaurus installation

Download address: http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2

# wget http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2

# tar jxvf scws-dict-chs-utf8.tar.bz2-C / usr/local/sphinx/etc/

# chown-R apache:apache / usr/local/sphinx/etc/dict.utf8.xdb

5. Php uses Sphinx+scws to test

1. In the Sphinx source code API, there are several API calls in several languages. One of them is sphinxapi.php.

Create a new Search.php file, a front-end page index.php

Code omission

Visit:

2. SphinxQL test

To use SphinxQL, you need to add the corresponding listening port to the configuration of Searchd (see configuration above).

# mysql-h227.0.0.1-P9306-ujiangjj-p

Mysql > show global variables

Mysql > desc items

Mysql > select * from master where match ('mysql*') limit 10

Mysql > show meta

For the above sphinx combined with scws to achieve full-text search of mysql related content, is there anything you don't understand? Or if you want to know more about it, you can continue to follow our industry information section.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.