What are the contents of the lucene4.7 index file 04/01 Update SLTechnology News&Howtos

What are the contents of the lucene4.7 index file

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the content of the lucene4.7 index file". In the daily operation, I believe that many people have doubts about the content of the lucene4.7 index file. The editor consulted all kinds of materials and sorted out a simple and easy-to-use method of operation. I hope it will be helpful to answer the question of "what is the content of the lucene4.7 index file?" Next, please follow the editor to study!

The following figure is a typical index structure diagram of Lucene4.x:

All indexes after Lucene4.x are formatted as follows:

The file name suffix describes the submission point information of the Segments Filesegments.gen, segments_N storage segment file Lock Filewrite.lock file lock, which ensures that only one thread can write to the index Segment Info.si at any time to store the metadata information of each segment file Compound File.cfs, .cfe composite index file, a virtual file on the system Fields.fnm for frequent file handles storing information for domain files Field Index.fdx storing pointers to domain data Field Data.fdt storing field information for all documents Term Dictionary.timterm dictionary, storing term information Term Index.tipterm dictionary index file Frequencies.frq word frequency file, containing a list of documents and each term and its word frequency Positions.prx location information, storing each term In the exact location in the index Norms.nrm.cfs, .nrm.cfe stores the encoding length of documents and fields and weighting factors Per-Document Values.dv.cfs, .dv.cfe coding with the exception of additional scoring factors, Term Vector Index.tvxterm vector index, store term offset distance in the document Term Vector Documents.tvd contains information on each document vector Term Vector Fields.tvf stores filed level vector information Deleted Documents.del stores index delete file

A composite index file means that except for segment information files, lock files, and deleted files, a series of other index files compress a file with the suffix cfs, meaning that all index files are stored as a singleton Directory, while the non-composite index is flexible, and several index files can be accessed separately, while composite index files cannot, because they are compressed into a single file. Therefore, in some scenarios to achieve higher efficiency, for example, frequent queries, but not frequently updated requirements, is very suitable for this index format.

The basic concept of lucene index consists of, index, documents, fields and items, an index, usually contains some sequence of documents, a document contains some sequence of fields, and some fields contain some sequence of items, and some items contain some column sequence of the lowest level of bytes, note that the sequence here refers to the order in the index structure, usually ordered in this way, some cases can optimize the index structure.

Lucene uses inverted index (Inverted Indexing) to store index information, which greatly improves the efficiency of retrieval.

Inverted index, to take a popular example, originally based on people's normal thinking, we will store those words in an article, while inverted index, on the contrary, it stores this word, contained in several documents, of course, this relationship is an index made up of inverted linked lists (storing a series of docid). When we search, we can quickly locate through this word. It appears in several articles, which greatly improves the retrieval performance.

Of course, there is not only inverted index in lucene, but also forward storage, and inverted is the core of lucene because it improves the retrieval performance. When retrieving specific documents, we need to take out the information positively, which is reflected in the actual code, that is, we retrieve each docid through retrieval, and then get the whole document through each docid, and then we get each domain in a positive way. And the specific information stored in each item. Of course, the premise is that you store this field. If you just index it and do not store it, then you can only retrieve this information, but you cannot get the specific term value. This needs to be designed before indexing. The storage structure of the index, those fields are retrieved, those fields are stored, and so on. If you still need to highlight some content. You also need to store the offset location of this field, so that you can accurately mark and retrieve the hit keywords in the text, and if you plan to highlight this in the foreground, do not store this information.

At this point, the study on "what is the content of the lucene4.7 index file" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.