Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize the Directory of ​ Lucene

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to realize Lucene Directory". In daily operation, I believe many people have doubts about how to realize Lucene Directory. Xiaobian consulted all kinds of information and sorted out simple and easy operation methods. I hope to help you answer the doubts about "how to realize Lucene Directory"! Next, please follow the small series to learn together!

Before that, let's look at the hierarchy of the Directory family.

From the above figure, we can see that Directory has a total of 11 direct or indirect subclasses. The roles and functions of different subclasses are different. Then Directory, as the top-level parent class of this inheritance graph, does play an important role in Lucene. Just like HDFS is the foundation of Hadoop, Directory shoulders the heavy responsibility of index storage. If there is no storage, then retrieval cannot be discussed. Although we often call it full-text retrieval, search engine or something, in fact, behind them, Directory is the unknown "Lei Feng."

Here is a detailed analysis of the core implementation of Directory.

Directory is a directory composed of a number of columns of index files in lucene. A screenshot of a typical index file structure diagram is as follows:

The role of Directory is to manage these index files, including reading and writing data, as well as adding, deleting and merging index files. From this point of view, Directory is more like a system administrator. Below, loose immortals will analyze the role of some core methods in detail.

We all know Lucene's index system, support read sharing, write exclusive way to access the index directory, that is, it allows multiple thread instances to read concurrently at the same time, but does not allow multiple threads to write at the same time, we may have questions, why not support multi-threaded write? This is actually because the index directory has its own internal state at a certain time, such as file pointers, and multi-threaded writing will cause pointer confusion, resulting in damage to the index structure or loss of some data, so lucene prohibits multiple threads from writing the index concurrently at any time. Even if it is multi-threaded writing, only one thread can be allowed to operate the index at a time. According to this situation, multi-threaded writing and single-threaded writing are analyzed. In terms of performance improvement, it is not obvious, so how does lucene control that only one thread can write at a time? Open the source code of Directory, and we will find that it actually maintains an instance of lock internally. Through locking, write operations of subsequent threads are prohibited. Of course, the role of lock is not only to prevent concurrent writes, but also to judge whether the two indexes are the same index by the lock name. So if we want to use multithreading to improve write speed, a compromise is that each thread writes a directory, and finally merges these directories. Here are some ways to implement locks in the source code.

protected LockFactory lockFactory;//lock implementation, can only be overridden by subclass//set lock name public Lock makeLock(String name) { return lockFactory.makeLock(name); } //clear lock public void clearLock(String name) throws IOException { if (lockFactory != null) { lockFactory.clearLock(name); } }

Let's analyze the role of another variable isOpen in the Directory source code

//note that the keyword volatile is used to modify volatile protected boolean isOpen = true;

isOpen is used to determine the state of the current Directory instance in memory. It uses the volatile keyword to modify the content modified by this variable. When the JVM virtual machine reads, it will directly read the value of this variable in main memory instead of reading it in the local memory of each thread. In this way, when reading concurrently, if the Directory instance is closed, then each reading thread will immediately obtain the latest state. If it is not processed, A directory instance shutdown exception will be thrown. isOpen ensures consistency in obtaining Directory state across thread instances when the index is read concurrently.

private static final class SlicedIndexInput extends BufferedIndexInput { IndexInput base; long fileOffset; long length; SlicedIndexInput(final String sliceDescription, final IndexInput base, final long fileOffset, final long length) { this(sliceDescription, base, fileOffset, length, BufferedIndexInput.BUFFER_SIZE); } SlicedIndexInput(final String sliceDescription, final IndexInput base, final long fileOffset, final long length, int readBufferSize) { super("SlicedIndexInput(" + sliceDescription + " in " + base + " slice=" + fileOffset + ":" + (fileOffset+length) + ")", readBufferSize); this.base = base.clone(); this.fileOffset = fileOffset; this.length = length; }

Next, to analyze the role of Directory static constant internal class SlicedIndexInput,Lucene index file is very loose, different types of data stored in different files, we can read the contents of the specified index file separately by file name, the same reason we can also, when writing information, write information about some part of the data separately, in this way, avoid the possibility of operating the entire directory, on demand, to a certain extent, This design improves performance, ensures data stability and reliability, and increases the complexity of Directory management to some extent, but these are trivial.

The role of the SlicedIndexInput class ensures that Lucene can read the contents of some index files separately. Note that these contents are not the most original data, but a copy of the clone of SlicedIndexInput. This is very advantageous in a concurrent read environment. Each thread will load a copy from main memory. In our source code, we did not find it has deep cloning function, but through a series of inheritance tracing, we found that SlicedIndexInput ==> BufferedIndexInput ==> DataInput, in the final parent class to implement Cloneable and Closeable interfaces, thus ensuring that SlicedIndexInput can work properly, and release some occupied IO resources.

At this point, the study of "how to implement Lucene Directory" is over, hoping to solve everyone's doubts. Theory and practice can better match to help everyone learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report