What is the principle of Lucene inverted index? 04/19 Update SLTechnology News&Howtos

What is the principle of Lucene inverted index?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the principle of Lucene inverted index". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what is the principle of Lucene inverted index"?

Search engine introduction 1.1 what is a search engine

Here is an introduction from Baidu Encyclopedia:

Search engine (Search Engine) refers to a system that collects information from the Internet according to certain strategies and uses specific computer programs. After organizing and processing the information, it provides retrieval services for users and displays relevant information to users.

1.2 search engine classification

Search engines include full-text index, catalog index, meta-search engine, vertical search engine, collective search engine, portal search engine and free link list, etc.

This paper mainly introduces the full-text index, that is, the classification of search engines used by Baidu.

Full-text index

The first is the collection of data in the database. There are two automatic information collection functions of search engines:

One is to search regularly, that is, every once in a while (for example, Google is usually 28 days), the search engine takes the initiative to send a "spider" program to search Internet websites within a certain range of IP addresses. Once a new website is found, it will automatically extract the information and URL of the website and add it to its own database.

The other is to submit a website search, in which the website owner takes the initiative to submit a URL to the search engine, which directs a "spider" program to your website within a certain period of time (ranging from 2 days to months). Scan your site and store the relevant information in the database for users to query.

When users search for information with keywords, the search engine will search the database, and if they find a website that matches the content required by the user, then use a special algorithm-- usually according to the matching degree, location, frequency and link quality of the keywords in the web page-- to calculate the relevance and ranking of each web page, and then according to the correlation degree, return these web page links to the user in order. This kind of engine is characterized by high search rate.

1.3 what problems can search engines solve?

Query data efficiently (use a variety of algorithms to query data at a millisecond level, whether tens of millions of pieces of data or hundreds of millions of data)

It is relatively easy to switch from an ordinary database to a search engine.

Large amount of data, timeliness, high concurrency and so on.

1.4 Application scenarios of search engines

When the database reaches the level of millions of data

It requires timely retrieval, high performance and Ms-level response.

1.5 Solr

Next, let's take a look at Solr, the application of search engines in the usual Internet. So what is Solr? What are the advantages and disadvantages of it compared with es?

Let's start with a brief introduction to solr:

Solr is a full-text search server based on Lucene. At the same time, it is extended to provide a richer user-oriented query language than Lucene, at the same time to achieve a configurable, extensible and optimize the query performance, and provide a perfect functional management interface. It supports Xml/Http protocol and JSONAPI interface.

It has the following characteristics:

Scalability: Solr can distribute indexing and query processing operations to multiple servers in a cluster.

Rapid deployment: Solr is open source software, which is easy to install and configure. It can be directly started according to the Sample configuration in the installation package, and can be divided into stand-alone and cluster modes.

Massive data: Solr is designed for massive data processing of more than 100 million, and can well handle massive data retrieval.

Optimized search function: Solr search speed is fast enough, for complex search queries Solr can achieve millisecond processing, usually, dozens of milliseconds can complete a complex query.

II. Introduction of word segmentation

Next, we will see how participle is implemented. So, why do we have to go to word segmentation? what does this have to do with search engines? How do the words or paragraphs we enter in the search box be broken into multiple keywords?

Have you heard of any word splitters? For example, lucene has its own Chinese word divider smartcn, as well as the most commonly used IK word splitter and so on. Today we will mainly talk about IK word splitter.

2.1 IK word splitter

The IK word splitter first maintains several dictionaries to record some commonly used words, such as the subject word list: main2012.dic, the quantifier quantifier.dic, and the stop word stopword.dic.

Dictionary loads the dictionary into the memory structure for the dictionary management class. The specific dictionary code, located in org.wltea.analyzer.dic.DictSegment. This class implements a core data structure of a word splitter, namely Tire Tree.

Tire Tree (Dictionary Tree) is a relatively simple tree structure, which is used to build dictionaries and find words quickly by comparing the alignment of prefix characters one by one, so it is sometimes called prefix tree. Specific examples are as follows.

Give an example

For example, I am the Chinese people of Zhongguancun, Haidian District, Beijing.

The dictionaries we have set up are: Beijing, Haidian District, Zhongguancun, China, Chinese people, then the dictionary tree composed according to the dictionary is shown in the figure:

Then we segment the passage according to the dictionary tree. In the IK word splitter, there are basically two modes: one is smart mode, and the other is non-smart mode, which can be configured during initialization in the code.

In fact, we do not have to explain the literal meaning of these two modes, we can see by directly printing the results of the two modes:

I am the Chinese people of Zhongguancun, Haidian District, Beijing.

Smart model: Beijing, Haidian District, Zhongguancun, Chinese people

Non-smart model: Beijing, Haidian District, Zhongguancun, China, Chinese people

Obviously, the non-smart mode outputs all the words that can be separated; the smart mode outputs a reasonable result of word segmentation according to the internal method, which involves ambiguity judgment.

Give an example

To take a more representative example: what Zhang San said is indeed reasonable.

Based on forward matching possible word chains:

English: {Zhang San, Zhang, San}

L2: {say}

L3: {really, reasonably, in, out of reason}

First, take a look at some of the most basic element structure classes:

Public class Lexeme implements Comparable {. / / initial displacement of a word element private int offset; / / relative starting position of a word element private int begin; / / the length of a word element private int length; / / a word text private String lexemeText; / / a word type private int lexemeType;. }

The word Lexeme here can be understood as a word or a word. Where begin refers to its position in the input text. Note that it implements Comparable, giving priority to the first starting position and longer length, which can be used to determine the position of a word in the meta chain of a word segmentation result, and can be used to get the order of each word in the word segmentation result in the above example.

Comparison algorithm of / * * words in sorted set * @ see java.lang.Comparable#compareTo (java.lang.Object) * / public int compareTo (Lexeme other) {/ / starting position first if (this.begin)

< other.getBegin()){ return -1; }else if(this.begin == other.getBegin()){ //词元长度优先 if(this.length >

Other.getLength () {return-1;} else if (this.length = = other.getLength ()) {return 0;} else {/ / this.length

< other.getLength() return 1; } }else{//this.begin >

Other.getBegin () return 1;}}

Smart schema ambiguity elimination algorithm:

Comparison of lexical position weight (lexical length product), which means to select a set of long lexical positions.

L31: {yes, indeed, it is reasonable} 1 "1" 2 "2" 3 "2" 11

L32: {indeed, really, reasonably} 1 "2" 2 "1" 3 "2" 10

L33: {indeed, indeed, Li} 1 "2" 2 "2" 3 "1" 9

The final result of participle: Zhang San, said, yes, indeed, reasonable

3. Introduction of inverted index algorithm 3.1

We can think of the inverted indexing algorithm as the directory when looking up a dictionary. When we know the directory of the words we need to look up, we will find it quickly. If it is explained in professional language, it is:

Inverted index originates from the need to find records according to the value of attributes in practical applications. Each item in such an index table includes an attribute value and the address of each record with that attribute value. Because it is not the record that determines the attribute value, but the attribute value that determines the location of the record, it is called an inverted index (inverted index). Files with inverted index are called inverted index files, referred to as inverted index files (inverted file).

An inverted file (inverted index), an index object is a document or a collection of words, etc., used to store the location of these words in a document or group of documents, is one of the most commonly used indexing mechanism for documents or document collections.

The key step of a search engine is to build an inverted index, which is generally expressed as a keyword, followed by its frequency (number of occurrences) and location (which article or web page it appears in, and the relevant date, author, etc.). It is equivalent to an index of hundreds of billions of pages on the Internet, like the catalog and label of a book. Readers want to see which topic is related to the chapter, directly according to the catalogue to find the relevant page. You don't have to look it up page by page from the first page to the last page of the book.

3.2 principle of Lucene inverted indexing

Lucerne is an open source high-performance java-based full-text search engine toolkit, not a complete full-text search engine, but a full-text search engine architecture, providing a complete query engine and indexing engine, part of the text analysis engine. The purpose is to provide a simple and easy-to-use toolkit for software developers to facilitate the realization of full-text retrieval in the target system, or to establish a complete full-text retrieval engine on this basis.

Suppose there are two articles 1 and 2:

The content of article 1 is: Jack lives in BeiJing,I live in BeiJing too. The content of article 2 is: He lived in Taiyuan.

1) get keywords

First of all, we have to use the way we talked about word segmentation, and then because of English reasons, we need to filter out words that have no practical meaning, such as in, on, of, and then restore the third-person singular plus s or the past tense plus ed, such as lived back to live,lives back to live, and then remove unnecessary punctuation. After the above processing, the remaining keywords are:

All the keywords in article 1 are: [Jack] [live] [BeiJing] [I] [live] [BeiJing]

All the keywords in article 2 are: [he] [live] [Taiyuan]

2) Establishment of inverted index

Keywords article number [frequency] occurrence location BeiJing 1 [2] 3 6 he 2 [1] 1 i 1 [1] 4 live 1 [2] 2 5, 2 [1] 2 Taiyuan 2 [1] 3 tom 1 [1] 1

These are the core parts of the lucene index structure. We notice that keywords are arranged in character order (lucene does not use a B-tree structure), so lucene can quickly locate keywords using a binary search algorithm.

3.3 implementation

When implementing, lucene saves the above three columns as dictionary files (Term Dictionary), frequency files (frequencies), and location files (positions) respectively. Among them, the dictionary file not only stores each keyword, but also retains the pointer to the frequency file and location file, through which the frequency information and location information of the keyword can be found.

3.4 Compression algorithm

In order to reduce the size of the index file, Lucene also uses compression technology for the index.

First of all, the keywords in the dictionary file are compressed as follows: for example, if the current word is "Arabic" and the previous word is "Arabic", then "Arabic" is compressed to.

The second thing that is heavily used is the compression of the number, which only saves the difference from the previous value (this reduces the length of the number, which in turn reduces the number of bytes needed to save it). For example, the current article number is 16389 (no compression is saved with 3 bytes), the previous article number is 16382, and save 7 after compression (only one byte is used).

3.5 reasons for use

Suppose you want to query the word "live". Lucene first looks for the word in the dictionary, finds the word, reads out all the article numbers through a pointer to the frequency file, and then returns the result. Dictionaries are usually very small, so the whole process takes milliseconds.

Using the ordinary sequential matching algorithm, instead of indexing, the string matching of all the articles will be very slow, and when the number of articles is very large, the time is often unbearable.

IV. Basic configuration and use of solr

We installed solr on the windows system.

After decompression:

Cmd goes to the bin directory of solr and uses the command solr start (for convenience, you can configure the environment variable of solr. After configuration, you can directly use solr naming in cmd)

When you see this interface, the solr service starts successfully. The port number is 8983. If you access http://localhost:8983, you will automatically jump to http://localhost:8983/solr/#/.

The basic information of solr, lucene information, Java information and so on will be displayed on this page.

Then we introduce a solr instruction: solr-h can see the basic information of solr.

Configure solr

Configure Core core

The solr create-c mycore-d baisc_configs:-c parameter specifies the core name of the definition, and the-d parameter specifies the configuration directory

After the command is executed, there will be an extra test directory.

Visit the Solr Admin page: http://localhost:8983/, check core, with an extra test

In the\ solr-6.3.0\ server\ solr\ test directory, there are conf and data directories, corresponding to configuration and data, respectively.

Add data to core

Open the directory:\ solr-6.3.0\ server\ solr\ test\ conf and add a field:

Then restart solr: solr restart-p 8983

Go to the Solr Admin page, select core-test-document, and fill in the data in Document (s):

{"id": "1", "name": "BMW"}

Click submit and return Status: success, which means the data has been added successfully.

After adding a few more entries, click Query to query the data:

Q in the query interface represents the query condition. For example, enter: name: "BMW" to execute the query again.

You can also access url: http://localhost:8983/solr/test/select?q=name: BMW directly through get.

At this point, I believe that everyone on the "Lucene inverted index principle is what" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.