Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the storage method of lucene inverted index

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what is the storage mode of lucene inverted index". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The inverted index storage of lucene includes the index data storage of words, the location storage of words and the writing process. These three parts are complicated to be described separately. Let's talk about the storage method of location first.

The location of words includes the frequency and location of words in each document, as well as the accompanying payload (the storage of norm information is ignored here). These three lucene are written in three output streams, as follows:

1. For each word, the document ID to which it belongs and the word frequency in the document will be recorded. Because the document ID has been sorted, the difference compression storage will be carried out when writing, while the document word frequency will be stored directly, and every 128records will be compressed and stored in blocks.

2. Make Doc=abc | 123 BCD Def, each time the document ID and word frequency are written, each position of the word in the document (refers to the nth word after the document has been segmented (the semicolon separator def is the third word), the offset of the start and end (refers to the beginning and end position of the document without word segmentation (the start and end positions of bcd are 8 and 11 respectively), and the total length of the word, payload and payload separator can be calculated by the offset. So don't think of this length as the length of the word and the length of the payload (123 is payload, the start and end positions of the abc are 0 and 7, respectively), and the payload information attached to each word. Because the location information and offset have been sorted, they will be compressed and stored according to the difference. The location information is compressed and stored in separate files according to 128 records. The length information and offset of payload are also compressed and stored in separate files according to 128records, while the contents of payload are not compressed but directly written into the same file.

For word frequency, it is compressed in blocks according to 128 records because there may be many documents containing the same word, and in extreme cases all documents have it. In addition to compression, it is also necessary to provide the function of randomly accessing the location information of each document, so a layer of index structure is established for the location information, and each word corresponds to a set of indexes.

The information to be recorded in the index includes the ID of the last document in the previous block, the file pointer of the last location block, the file pointer of the last payload block, the number of remaining uncompressed location information and the length of the remaining uncompressed payload array. (the content of this index will be explained in detail in the next section.)

When writing a term, for the remaining content that does not reach 128records, the document ID and word frequency are compressed according to vint mode, the position, payload length and offset are also compressed according to vint mode, and the payload content is written directly.

This is the end of the content of "how lucene inverted indexes are stored". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report