Brief introduction of Solr and introduction of inverted Index usage 07/03 Update SLTechnology News&Howtos

Brief introduction of Solr and introduction of inverted Index usage

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "Solr brief introduction and inverted Index usage introduction". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. brief introduction of Solr

1. What is Solr?

Solr is an open source search platform based on Lucene developed by Java. The core of its search technology is to use inverted index, that is, to map to the corresponding document (value--key) through keywords, which is different from the key--value used in general search.

The resource storage in Solr takes the document Document as the object, and the content of the document is composed of multiple Field representing resource attributes. Solr takes the Field in the document as the index after word segmentation, matches the keyword with the index of the sorting number by dichotomy, and then finds the corresponding document, which provides high performance search efficiency. Each document is represented by a unique id field.

2. Why use Solr?

As most of the traditional e-commerce use traditional search, that is, the traditional search is to filter out the qualified results from the static database, this result is often immutable and static. Usually, the e-commerce system needs to provide search function to search for matching results through any keyword. However, these arbitrary data can not be queried according to the fields of the database, so we need to use the full-text search tool to segment the data in advance, and then through the results of word segmentation, search the corresponding documents according to the word segmentation, and feedback the search results to users. And Solr can achieve this search function through inverted indexing function, technology, combined with IKanalyzer Chinese word divider.

3. The relationship and difference among Solr, elasticsearch and Lucene.

(1) introduction of the three

Lucene is a set of information retrieval toolkit, does not include the search engine system, it includes index structure, read-write index tools, relevance tools, sorting and other functions, so when using Lucene, we still need to pay attention to the search engine system, such as data acquisition, parsing, word segmentation and so on.

Solr is a Lucene-based search platform with HTTP interface, which encapsulates a lot of Lucene details. Our own applications can directly use HTTP GET/POST requests to search, maintain and modify the index.

Elasticsearch is also a search engine based on the full-text search engine Apache Lucene. The strategy adopted is distributed real-time file storage and each field is indexed so that it can be searched.

(2) connection and difference

The three links: both solr and elasticsearch are packaged based on the Lucene toolkit.

The difference between solr and elasticsearch:

A. solr uses zookeeper for distributed management, while elasticsearch itself has distributed coordination management function.

B. the implementation of solr is more comprehensive than elasticsearch, and solr officially provides more functions, while elasticsearch itself pays more attention to core functions, and advanced functions are mostly provided by third-party plug-ins.

C. solr performs better than elasticsearch in traditional search applications, while elasticsearch performs better than solr in real-time search applications.

II. Introduction of inverted index

1. Definition of index

An index file (index) contains a series of documents (Documents), a document (document) is made up of a series of fields (fields), and a fields (field) can be segmented into a series of term (words / strings).

2. Inverted index

Index (index) stores statistics of words (terms) in order to make term-based retrieval more efficient. Inverted index is a specific storage form to implement "word-document matrix". Through inverted index, you can quickly get a list of documents containing this word according to the word.

(1) inverted index: mainly composed of word dictionaries and inverted files

a. Word dictionary:

A word dictionary is a collection of strings made up of all the words that have appeared in a document collection. First of all, Solr documents refer to the storage objects that exist in the form of text, in addition to web pages, but also contain Word,PDF,html,XML and other different formats of files, and even e-mail, Weibo and so on can be called documents. Each document has its own unique document ID. After the Field word in the document is segmented by the word splitter, Solr removes the repetition of the words, and the collection of these words forms a word dictionary. Each word also has a unique word ID.

b. Inverted file

An inverted file is a physical file that stores an inverted list, stored on disk. The inverted list records the document list of all documents in which a word has appeared, as well as the location information and frequency of the word in the document. Each record is called an inverted item, and multiple inverted items form an inverted list.

(2) the principle of inverted index

First of all, we should use the word segmentation system to automatically segment the document into word sequences. In this way, each document is converted into a data stream made up of a sequence of words, with each different word having a unique word number, and recording the inverted list (ID of the document in which the word appears, the frequency and location of the document in which the word ID appears), and the "document frequency information" corresponding to each word (which has appeared in several documents), after the end of such processing. We can get the inverted index.

The inverted index storage structure is shown in the following figure.

Summary: the storage structure of the inverted index is generally as follows: the string of the lexical item ID+ the document frequency of the lexical item + the inverted list (the document frequency information of the lexical item ID+ records the location information of the lexical item). The inverted index records the words, and the document id in which the words exist, as well as an inverted list. Through the storage mode of this index structure, the query rate can be imagined.

This is the end of the introduction of "Solr brief introduction and inverted Index usage introduction". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.