Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the principle of Lucene full-text retrieval

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the principle of Lucene full-text retrieval". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the principle of Lucene full-text retrieval".

Among the data we are dealing with, there are three types of data:

Structured data: data with a fixed format or limited length, such as data in our database

Unstructured data: data with no fixed format and no fixed length, such as the text content on our web

Semi-structured data: such as Json, XML data.

So how do we deal with these different types of data?

For structured data in the database, use SQL statements to query

For unstructured data, we scan sequentially and retrieve full text.

Among them, sequential scanning is scanning from the beginning of the data to the last piece of data. Obviously, this is a great waste of time and performance.

So what is full-text search?

This is what Lucene is going to do. Let's first look at a picture to describe its role in the entire system:

For the application part of the upper layer of lucen, we can see that the application phone has structured, semi-structured and unstructured data, which is indexed by lucene; another application is retrieval, in which users search our index database by entering keywords of search conditions, and then return the results to users.

So what is an index?

Just like Pinyin search and radical Index in Xinhua Dictionary are used to look up words.

The same is true in lucene, where full-text search refers to the documents in which a word has appeared. For example:

In the image above, the keyword "lucene" appears in articles 1 and 3. The keyword "Solr" appears in articles 1, 3 and 5. The keyword "hadoop" appears in articles 3, 5, 7, 8 and 9.

Here we call the whole process "reverse indexing". The document list of each keyword on the right is called an inverted list.

What is a reverse index?

Reverse indexing: this string-to-file mapping is a reverse process of file-to-string mapping. In fact, it describes a mapping relationship.

Create an index

Okay. So what are the steps for creating a full-text search?

Here we divide the creation of a full-text search into three steps, or three things you need:

Data to be retrieved (Document)

Word Segmentation Technology (Analyzer)

Index creation (Indexer)

Let's give an example.

The first step, Document data instance

My blog space

HappyBKs's Lucene article

HappBKs's Hadoop article

The second step is the word segmentation technology. We use standard participle here. )

I | Yes | blog | customer | empty | interval

Happybks | of | lucene | text | Chapter

Happbks | of | hadoop | text | Chapter

Note that after standard word segmentation, Chinese is segmented by word, and English uppercase characters are converted to lowercase.

The third step is index creation.

Term

IDTermIDTermID my 1happybks2happbks3's 1, 2, 3, 1lucene2hadoop3, 1, 2, 3, empty, 1 chapter, 2 chapters, 3 rooms.

We merge the indexes.

TermIDTermIDTermID I 1happyybks2jue 3

1, 2, 2, 3

Blog 1lucene2hadoop3, one article, two, three.

Empty 1 chapter 2pence 3

Interval 1

This table is what we call an index.

Now, let's look at how to use the index for retrieval.

Index retrieval

There are four steps:

Search keywords (keywords)

Word Segmentation Technology (Analyzer)

Retrieval Index (Search)

Return the result

Let's put it in an example to sort out the steps.

The first step is to get the keywords searched by the user.

Lucene article

The second step is to adopt the technology of word segmentation

Lucene | text | Chapter

The third step, retrieve the index.

From the figure above, we can see that in the inverted table, the document containing all the participle units of keywords is document 2.

Thank you for your reading, the above is the content of "what is the principle of Lucene full-text retrieval". After the study of this article, I believe you have a deeper understanding of what the principle of Lucene full-text retrieval is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report