Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize full-text search Sphinx

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how to achieve full-text search Sphinx". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Brief introduction to 1.Sphinx what is 1.1.Sphinx

Sphinx is a full-text search engine developed by Russian Andrew Aksyonoff. It is intended to provide full-text search function with high speed, low space consumption and high result relevance for other applications. Sphinx can be easily integrated with SQL databases and scripting languages. The current system has built-in support for MySQL and PostgreSQL database data sources, and also supports reading XML data in a specific format from standard input. By modifying the source code, users can add new data sources themselves (for example, native support for other types of DBMS)

Characteristics of 1.2.Sphinx

High-speed indexing (peak performance up to 10 MB/ seconds on modern CPU)

High-performance search (on 2-4GB text data, the average response time per retrieval is less than 0.1s)

Can handle large amounts of data (it is known to handle more than 100 GB of text data and 100m documents on a single CPU system)

Provides an excellent correlation algorithm, a compound Ranking method based on phrase similarity and statistics (BM25).

Support for distributed search

Support phrase search

Provide document summary generation

It can provide search service as the storage engine of MySQL.

Support Boolean, phrase, word similarity and other retrieval modes

The document supports multiple full-text retrieval fields (up to 32)

The document supports multiple additional attribute information (e.g. grouping information, timestamp, etc.)

Support verdict

1.3.Sphinx Chinese word segmentation

Chinese full-text retrieval is different from English and other latin series, the latter is based on special characters such as spaces to determine words, while Chinese is based on semantic word segmentation. At present, most databases do not support Chinese full-text retrieval, such as Mysql. Therefore, there are some plug-ins for Mysql Chinese full-text search in China, and the ones that do better are hightman Chinese word segmentation. If Sphinx needs to conduct full-text search in Chinese, it also needs some plug-ins to supplement it. The plug-ins I know are coreseek and sfc.

Coreseek is the most widely used Chinese full-text search in sphinx. It provides a Chinese word segmentation package LibMMSeg designed for Sphinx. It also provides a number of binary distributions of systems, including binary packages under rpm,deb and windows. In addition, coreseek has contributed the following to sphinx:

Data source support for GBK coding

Chinese word Segmentation based on Chih-Hao Tsai MMSEG algorithm

Chinese user manual (this Chinese manual provides great convenience for domestic novice sphinx users, especially those who are not very good at English)

Sfc (sphinx-for-chinese) is another Chinese word segmentation plug-in provided by netizen brother happy. Its Chinese dictionary uses xdict. According to its introduction, after testing, the current version of the index speed (Linux test platform) can basically reach half of the index UTF-8 English, that is, half the officially claimed speed. Time is mainly spent on participle. Sphinx-for-chinese-0.9.10-dev-r2006.tar.gz is now available to synchronize with the latest version of sphinx (sphinx 0.9.10). Sql_attr_string has been added to this version and has been tested by myself. Its installation and configuration are very convenient. Brother happy has another contribution to word segmentation-php-mmseg, which is an extension of php's Chinese word segmentation library.

Here, I would like to pay the greatest respect to the above two authors

In addition, if you are not interested in Chinese word segmentation. Or you only need to implement functions similar to like in sql, such as select * from product where prodName like'% Mobile%'. Sphinx will not let you down, this may be the official website of the simple implementation of Chinese-direct word indexing. And the search speed is not wrong ^ _ ^.

This article will test the above three Chinese applications and record them in the form of documents, which may be the focus of this document.

two。 Installation configuration instance 2.1 installed on a GNU/Linux/unix system

There are two ways to apply Sphinx on mysql:

①, using API calls, such as using API functions or methods of PHP, java, etc. The advantage is that there is no need to recompile the mysql, the server process is "low coupled", and the program can be called flexibly and conveniently.

The disadvantage is that under the condition of the existing search program, some programs need to be modified. Recommended for programmers.

②, compile sphinx into a mysql plug-in using plug-in mode (sphinxSE) and retrieve it using specific sql statements. Its characteristic is that it is easy to combine at the SQL end, and the data can be returned directly to the client.

There is no need to query twice (note), only the corresponding sql needs to be modified programmatically, but this is very inconvenient for programs developed using the framework, such as using ORM. In addition, mysql needs to be recompiled, and a version above mysql-5.1 is required.

Plug-in storage is supported. System administrators can use this method

Secondary query note: until now the version is released-after retrieving the result, sphinx-0.9.9,sphinx can only return the recorded ID, not the sql data to be checked, so you need to query again from the database according to these ID.

The 0.9.10 version of sphinx under development can already store these text data. The author has tried that the performance and storage effect are not good. After all, there is no official version yet.

The first way is adopted in this paper.

To install under * nix system, you first need the following software support

Software environment:

Operating system: Centos-5.2

Database: mysql-5.0.77-3.el5 mysql-devel (use mysql-5.1 above if you want to use the sphinxSE plug-in for storage)

Compiled software: gcc gcc-c++ autoconf automake

Chinese full-text retrieval is different from English and other latin series, the latter is based on special characters such as spaces to determine words, while Chinese is based on semantic word segmentation. There are mainly two plug-ins for Chinese word segmentation.

Coreseek is the most widely used Chinese full-text search in sphinx. It provides a Chinese word segmentation package LibMMSeg designed for Sphinx and is developed on the basis of sphinx.

Sfc (sphinx-for-chinese) is another Chinese word segmentation plug-in provided by netizen brother happy. Its Chinese dictionary uses xdict.

This paper mainly introduces the installation method of Coreseek.

4. Installation of Coreseek (sphinx that supports Chinese search)

Because coreseek requires autoconf version 2.64 or above, you need to upgrade autoconf, or you will report an error downloading autoconf-2.64.tar.bz2 from http://download.chinaunix.net/download.php?id=29328&ResourceID=648, as follows:

Tar-jxvf autoconf-2.64.tar.bz2

Cd autoconf-2.64

. / configure

Make

Make install

The new version of coreseek puts dictionaries and sphinx source programs in a package, so you only need to download the coreseek package.

Http://pan.baidu.com/s/1dEK4x3r

Tar xzvf coreseek-3.2.14.tar.gz

Cd mmseg-3.2.14

The warning information output from. / bootstrap # can be ignored. If error occurs, it needs to be resolved.

. / configure-- prefix=/usr/local/mmseg3

Make & & make install

Cd..

Cd csft-3.2.14

The warning information output from sh buildconf.sh # can be ignored. If error occurs, it needs to be resolved.

. / configure-prefix=/usr/local/coreseek-without-unixodbc-with-mmseg--with-mmseg-includes=/usr/local/mmseg3/include/mmseg/-with-mmseg-libs=/usr/local/mmseg3/lib/-with-mysql

Make & & make install

Cd..

If this error occurs config.status: error: cannot find input file: src/Makefile.in, execute the following command just before configure

Aclocal

Libtoolize-forceautomake-add-missingautoconfautoheadermake clean "how to achieve full-text search Sphinx" content is introduced here, thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report