What is the full-text search of lucene 04/16 Update SLTechnology News&Howtos

What is the full-text search of lucene

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what is the full-text retrieval of lucene". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Definition: Lucene is an efficient, Java-based full-text search library. So it takes a lot of effort to learn about full-text search before you know Lucene.

So what is full-text search? It starts with the data in our lives. There are two kinds of data in our lives: structured data and unstructured data.

Structured data: data with fixed format or limited length, such as database, metadata, etc.

Unstructured data: data that is indefinite in length or in no fixed format, such as mail, word documents, etc.

Of course, some places will also mention the third kind of semi-structured data, such as XML,HTML, when it can be processed according to structured data as needed, or pure text can be extracted as unstructured data. Unstructured data is also called full-text data. According to the classification of data, search is also divided into two types:

Search for structured data: such as searching a database, using SQL statements. Another example is the search for metadata, such as the use of

Windows search searches for file names, types, modification times, etc.

Search for unstructured data: for example, using windows search can also search for file content, grep under Linux

Command, such as Google and Baidu, you can search a large amount of content data.

There are two main ways to search for unstructured data, that is, full-text data:

One is (Serial Scanning) sequential scanning: the so-called sequential scanning, such as looking for files containing a certain string, is the look of a document, for each document, from the beginning to the end, if this document contains this string, then this document is the file we are looking for, and then look at the next file until all the documents have been scanned. Such as the use of windows search can also search the contents of the file, but quite slow. If you have an 80g hard drive, if you want to find a file containing a string on it, it won't take him a few hours to do it. The same is true of the grep command under Linux.

A way. You may think this method is relatively primitive, but for files with a small amount of data, this method is still the most direct and convenient. But for a large number of files, this approach is slow. Some people may say that the sequential scanning of unstructured data is slow, but the search of structured data is relatively fast (because structured data has a certain structure, we can adopt certain search algorithms to speed up the speed). So why don't we just find a way to make our unstructured data structured? This idea is very natural, but it constitutes the basic idea of full-text retrieval, that is, part of the information in unstructured data is extracted, reorganized and made into a certain structure, and then the data with a certain structure is searched. in order to achieve the purpose of relatively fast search. This part of the information extracted from unstructured data and then reorganized is called an index. This statement is relatively abstract, and it is easy to understand with a few examples. for example, a dictionary, a dictionary's pinyin table and a radical search list are equivalent to the dictionary's index, and the interpretation of each word is unstructured. If the dictionary does not have a syllable table and radical search list, finding a word in the vast sea of words can only be scanned sequentially. However, some information of the word can be extracted for structural processing, such as pronunciation, which is more structured, with only a few consonants and vowels that can be enumerated respectively, so the pronunciation is taken out and arranged in a certain order. each pronunciation points to the number of pages of a detailed explanation of the word. When we search, we find the pronunciation according to the structured pinyin, and then according to the number of pages it points to, we can find our unstructured data-that is, the interpretation of the word. This process of building an index and then searching the index is called full-text search (Full-text Search). The following picture is from "Lucene in action", but it not only describes the retrieval process of Lucene, but also describes the general process of full-text retrieval.

Full-text retrieval consists of two processes, index creation (Indexing) and search index (Search).

Index creation: the process of extracting information from all structured and unstructured data in the real world and creating an index.

Search index: the process of getting the user's query request, searching the created index, and then returning the results.

Therefore, there are three important problems in full-text retrieval:

1. What exactly is stored in the index? (Index)

two。 How do I create an index? (Indexing)

3. How do I search the index? (Search)

This is the end of the content of "what is the full-text search of lucene". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.