The storage mode of lucene inverted index 07/13 Update SLTechnology News&Howtos

The storage mode of lucene inverted index

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "the storage mode of lucene inverted index". In the daily operation, I believe that many people have doubts about the storage mode of lucene inverted index. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "lucene inverted index storage mode". Next, please follow the editor to study!

Through the analysis of the storage mode of lucene inverted index (3-2) and the storage mode of lucene inverted index (3-1), we can know that when the number of words with common prefixes reaches the specified threshold, it will be written to a Block, but the details have not been analyzed. Now, through the given input abc, abcd, abcde, abcdef, abcdf, abcdg, abe, abf, abg, abh. And parameter thresholds MIN_BLOCK_SIZE = 3 and MAX_BLOCK_SIZE = 4 to further study the composition of the inverted index.

When entering abe, the number of abcd, abcde, abcdef, abcdf, and abcdg prefixed with abcd has exceeded the number of MIN_BLOCK_SIZE, so the condition for writing a Block is reached. Because of MAX_BLOCK_SIZE=4, these five item need to be further split into a group of MIN_BLOCK_SIZE3, that is, abcd, abcde, abcdef will be written to a Block (the Block is also a leafBlock, because its elements have only words), except for the prefix abcd. The rest of the suffix and postings metadata (compressed) are all written to the termsOut file stream. For the Block, there are the following important attributes prefix, startFP, hasTerms, isFloor, floorLeadLabel, subIndices, where prefix represents the common prefix, that is, abcd,startFP, which represents the termsOut file stream pointer (through which the information of each word in the Block can be parsed from the tim file), and hasTerms indicates whether the block contains words (it is obvious that the Block is full of words). IsFloor for True indicates that the block has been further partitioned due to more than the number of MAX_BLOCK_SIZE, floorLeadLabel the prefixLen character of the first word in the block (the block is-1), and subIndices represents the index of the subblock (the block has no subblocks). Then abcdf and abcdg are also written to a new block, but in this case prefix records abcdf,floorLeadLabel records f. After writing these two blocks, we have to figure out how to locate the two Block: first, the Floor of both blocks is True, first write the fp, hasTerm tag and isFloor of the first block, and then write the number of remaining Block to write floorLeadLabel for each Block (mainly used to query a word, if it does not exist, there is no need to load all the words under the block), the information will eventually be converted into a byte array and written into the FST together with the prefix abcd The FST is saved in the first Block, and finally the abcd, abcde, abcdef, abcdf, and abcdg in the pending list are replaced with the Block, so the elements in the pending list are pendingTerm:abc, pendingBlock:abcd, and the new pendingTerm:abe.

Continue to add abf, abg, and abh, and there are 6 entry in the pending list, of which 5 are term,1 block. Finally, you need to continue to divide the six entry into blocks according to the threshold set by MIN_BLOCK_SIZE and MAX_BLOCK_SIZE: first, term:abc, block:abcd, and term:abe will form a new Block, in which the file pointer of block will be recorded when writing block:abcd, and then term:abf, abg, and abh will also form a new Block. Finally, the two new Block will be written together to FST and saved in the first Block. At this point, the analysis of the storage mode of the inverted index is over, in fact, the logical view of the inverted index of lucene is still tree-shaped (the index data will eventually be saved in the form of FST, so FST is a graph because FST can share suffixes, but it can also be regarded as an optimization of the tree, we can think of FST as a prefix tree with shared suffixes), and the jump list implemented in the early days is also tree-like. The B-trees commonly used in the relational database are also tree-like, and they have certain differences in the efficiency of query and the real-time performance of reading and writing. Later, these three forms of trees are carefully analyzed. After that, it focuses on the analysis of index merging, multi-thread writing, real-time and transactional aspects of lucene!

At this point, the study on "the storage of lucene inverted indexes" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.