Introduction to the basic concept and inverted Index of Slor 04/19 Update SLTechnology News&Howtos

Introduction to the basic concept and inverted Index of Slor

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains the "introduction of the basic concepts and inverted index of Slor". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "the basic concept of Slor and the introduction of inverted index".

1. Introduction of Solr

Solr is an open source, Lucene Java-based search server that is easy to add to Web applications.

Solr provides flat search (that is, statistics), hit eye-catching display, and supports a variety of output formats (including XML/XSLT and JSON). It is easy to install and configure and comes with a HTTP-based management interface. Solr has been used in many large websites, which is more mature and stable.

Solr wraps and extends Lucene, so Solr basically follows the relevant terminology of Lucene. More importantly, the indexes created by Solr are fully compatible with the Lucene search engine library.

With proper configuration of Solr, coding may be required in some cases, and Solr can read and use indexes built into other Lucene applications.

In addition, many Lucene tools (such as Nutch, Luke) can also use indexes created by Solr. You can use Solr's excellent basic search capabilities, or you can extend it to meet the needs of the enterprise.

Second, the advantages of Solr

Advanced full-text search function

Optimization designed for high-throughput network traffic

Standards based on open interfaces (XML and HTTP)

Integrated HTML management interface

Scalability-ability to effectively copy to another Solr search server

Use XML configuration for flexibility and adaptation

Extensible plug-in system.

III. Introduction to index

The basic concepts in Lucene are: index (index), document (document), field (field) and term (terminology)

An index file (index) contains a series of documents (document)

A document consists of a series of fields (fields), similar to a record in a database

A field (field) consists of a series of terms (terms)

A term (term) is a string

Ps: the same string is considered different term in different fields

4. Inverted index (index core)

Index stores the statistical data of terms in order to make the retrieval based on term more efficient. Compared with the B-TREE structure in oracle, solr search engine uses an inverted index. Inverted index (Inverted Index): inverted index is a specific storage form to implement "word-document matrix". Through inverted index, you can quickly get a list of documents containing this word based on the word. The inverted index mainly consists of two parts: "word dictionary" and "inverted file". Documents (Document): general search engines deal with Internet web pages, but the concept of documents is broader, representing storage objects that exist in the form of text, and covering more forms than web pages, such as Word,PDF,html,XML and other files in different formats can be called documents. For example, an email, a text message, a Weibo can also be called a document. In the rest of this book, documents will be used to represent text information in many cases.

Chinese and English and other languages are different, there is no clear separation between words, so first of all, the document should be automatically divided into word sequences with a word segmentation system. In this way, each document is converted into a data stream composed of a sequence of words. in order to facilitate the subsequent processing of the system, we need to give a unique word number to each different word and record which documents contain the word. We can get the simplest inverted index. In the figure, the "word ID" column records the word number of each word, the second column is the corresponding word, and the third column is the inverted list of each word. For example, the word "Google" is numbered as 1, and the inverted list is {1, 2, 4, 4, 5}, which means that every document in the document collection contains the word.

The inverted index shown in the figure above is the simplest because the index system records only which documents contain a word, when in fact, the index system can record more information than that. The following figure is a relatively complex inverted index. Compared with the basic index system in the above figure, the inverted list corresponding to the word records not only the document number, but also the word frequency information (TF), that is, the number of times the word appears in a document. In the following figure, the word "founder" is numbered 7, and the corresponding inverted list is: (3:1), where 3 represents the document with document number 3 contains the word, and the number 1 represents word frequency information. that is, this word appears only once in document 3, and the inverted list corresponding to other words has the same meaning.

The practical inverted index can also record more information. in addition to recording document number and word frequency information, the index system records two types of information besides document number and word frequency information. that is, the "document frequency information" corresponding to each word and the location information of the word in the inverted list.

Document Frequency Information represents how many documents in the document collection contain a word

Take the word "Russ" as an example, the word number is 8 and the document frequency is 2, which means that two documents in the whole document collection contain the word, and the corresponding inverted list is: {(3 / 1;), (5 / 1). )}, which means that the word appears in document 3 and document 5 with a frequency of 1, and the word "Lars" appears in 4 in both documents, that is, the fourth word in the document is "Lars".

5. The process of creating an index in Lucene

Collect the original documents to be indexed

Get the original document from database, web, etc.

Give the original document to the word Segmentation component (Tokenizer)

This process is called Tokenize, and the result is called Token. Will do the following things: a. Divide the document into separate words b. Remove punctuation b. Remove stop words (stopword)

Give the resulting Token to the language processing component (LinguisticProcessor)

The result of this process is that Term will do the following things: a. Convert to lowercase b. Reduce words to roots, such as cars-- > car c. Turn a word into a root, such as drove-- > drive

Give the resulting Term to the index component (Indexer)

Will do the following things: a. Create a dictionary with the resulting Term b. Sort dictionaries alphabetically c. Merge the same Term for inverted index tables thank you for your reading, the above is the "introduction of the basic concepts of Slor and inverted index" content, after the study of this article, I believe that you have a deeper understanding of the basic concepts of Slor and inverted index introduction of this problem, the specific use of the need for practice to verify. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.