The combined use of Lucene and HBase and the report Analysis of HBasene 07/02 Update SLTechnology News&Howtos

The combined use of Lucene and HBase and the report Analysis of HBasene

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

The combined use of Lucene and HBase and the report analysis of HBasene, aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Introduction to Lucene

In Lucene, the search body is in the form of document. Document consists of fieldName and fieldValue, and each fieldValue can be composed of one or more term elements. Based on different word segmentation and indexing rules, the term that can be used to search fieldValue is less than the term that makes up fieldValue. Lucene's search is based on a reverse index and contains field information that can be used to search for document. With Lucene, you can look up the document forward to see what field information it contains, or you can query the document that contains the term by searching the term of the field through the reverse index.

[figure 1] overall architecture of Lucene

As shown in figure 1, IndexSearcher implements the logic of search, IndexWriter implements document insertion and reverse indexing, and IndexReader is called by IndexSearcher to read the contents of the index. Both IndexReader and IndexWriter rely on the abstract class Directory,Directory to provide API for manipulating index data.

Standard Lucene is file system-based and memory-based.

The disadvantage of the standard file system-based back-end is that performance degrades as indexes increase, and people use a variety of different techniques to solve this problem, including load balancing and index fragmentation (index sharding, which splits indexes among multiple Lucene instances). Although sharding is powerful, it complicates the overall implementation architecture and requires a great deal of predictive knowledge of the expected document in order to fragment the Lucene index properly. On the other hand, in the case of a large amount of data, the cost of merging segment is huge; frequent update data will make Lucene have a great impact on Disk io. The update of a new data may cause some indexes that have not changed at all to be rewritten many times, and may lead to a lot of small index segment, resulting in search performance degradation.

The advantage of Lucene lies in the speed of index lookup rather than the storage of document. In order to solve the above problems, the back-end structure of storing index based on NoSQL database arises at the historic moment.

Below, we will analyze it based on the implementation of HBase.

Realization method

In Lucene, it manipulates two separate datasets:

The document data set stores all documents, including the stored fields, and so on.

The index dataset stores all field / vocabulary / word frequency / location and other information, as well as the document that contains the current field

If you are going to implement porting the back end of Lucene to HBase, it is not the easiest to build an implementation of Directory directly. In the existing open source projects, both Lucandra and HBasene directly rewrite IndexReader and IndexWriter, bypassing Directory's API directly. This implementation does not override Lucene's index query mechanism. If you overload IndexSearcher, you can enhance performance based on the capabilities of the back end by using the existing Lucene index query mechanism.

[figure 2] back-end redesign of Lucene

Figure 2 is designed to integrate the Lucene backend with HBase to store index data in HBase, thus leveraging the big data storage and distributed performance of HBase.

Architecture design

In the architecture design, HBase is used as the persistent back end of the index, and a set of caching mechanism based on memory can be implemented to improve the data reading speed, as mentioned on the Internet. The implementation of an efficient cache synchronization mechanism will also help to improve the data read and write rate.

[figure 3] HBase backend implementation with memory cache and synchronous cache

For HBase access, each interaction needs to go through Ethernet, the running status of Ethernet will greatly affect the use of the system, and the establishment of the index hopes to achieve real-time and high response. To balance these two conflicting requirements, caching can greatly improve performance by minimizing the amount of data read by HBase for search and file return in memory; run as multiple Lucene examples as needed to support the growing ability of search clients. The latter needs to minimize the life cycle of the cache to synchronize with the contents of the HBase instance (a copy of the instance mentioned above). We can reach a compromise by implementing a configurable cache time for the activity parameters and limiting the cache presented in each Lucene instance.

According to the structure described above, for read operations, it is first checked whether the required data is in memory and has not expired, and if it is valid, it will be used directly, otherwise the data will be obtained from HBase and updated to memory. For write operations, it can be simplified to write data directly to HBase, without having to consider the complex problem of establishing or updating caches, which will also improve the real-time responsiveness of the system.

The realization of HBase Table

At present, we have learned that there are two reference implementations: HBasene type and Lucandra type.

1Murray-HBasene

Its index table consists of the following column family:

Fm.sequence: record the sequenceId, indicating the number of document currently added. The row is created when the createLuceneIndexTable is executed, and the rowKey is segmentId,Column.qulifier and qual.sequence,Column.value=-1. For each document of add, the Column.value of the current segmentId will be incremented by 1.

Fm.doc2int: the storage of each document will be assigned a unique id. If the Field.Store=YES of the document, all the information of the corresponding document can be obtained through this id.

Fm.fields: records the content of value in Field. RowKey is the content of documentId,Column.qulifier and FieldName,Column.value is FieldValue.

Fm.termVector: vector offset data for fuzzy search, recording the offset and other information. RowKey is the combination of FileldName/Term, and Column.qulifier is all the position offsets in the document pointed to by documentId,Column.value. The structure of Column.value is: [a] [size] [position]. [position]

Fm.termFrequencies: the frequency with which keywords appear in each document. The rowKey structure is zfm/FileName/Term,Column.qulifier and documentId,Column.value is the number of occurrences.

2Murray-Lucandra

In Lucendra, there are only two ColummFamily to store data, TermInfo and Documents

TermInfo

TermInfo stores the information of Lucandra inverse index, which is used to store the Field information of index. Its structure is as follows:

RowKey:field/term

SuperColumn.name:documentId

[SubColumn.name: "frequencies" Column.value:count]

[SubColumn.name: "position" Column.value:position vector]

[SubColumn.name: "offsets" Column.value:offsets vector]

[SubColumn.name: "norms" Column.value:norms vector]

Since the concepts of SuperColumn and SubColumn do not exist in HBase, we can simplify them to:

RowKey:field/term

Column.qulifier:documentId

Column.value:fieldInfo [this fieldInfo can be defined through the AVRO class]

Documents

Documents stores Document data, which is structured as follows:

RowKey:documentId

Column.name:fieldName

Column.value:fieldValue

Contrast:

As can be seen from the table structure of HBasene, for every update of document and every query of term, multiple rows of data need to be manipulated, which is relatively complex logically, while Lucandra only needs to deal with two rows (documents and TermInfo each). But the advantage of HBasene is that the single-row storage structure is small, each interaction with HBase only needs to transfer a small amount of data and can, and unlike the Lucandra structure, it does not need external AVRO components to serialize and deserialize data.

Defects in HBasene:

1. HBasene stopped updating its project three years ago, and open source support has been disconnected.

2. In the process of analyzing the latest source code of HBasene, the first thing is to find that its design retains the concept of segment, but its design hinders the query of document. Its documentId is composed of segmentId/sequenceId, and the commit,sequenceId of each segment is restored to-1. The TopDocs returned when search contains only sequence ID [int].

3. Each time you add document, only the data of fm.Fields and fm.doc2int are real-time flush to the HBase, while other data need to be commit when the segment size reaches a certain extent [default 1000 document], which is easy to cause data loss. If the segment is not commit, the newly added document cannot be queried.

4. The design of HBase Table does not take into account the situation when fieldName is the same in document. When the fieldName is the same, what is added later will overwrite the data that was added before, and only the last copy will be retained.

5. In fm.termVector, Column.value does not save positionVector information, but the number of times it appears in all document in segment.

6. Fm.termFrequencies is simply stored and has not been applied.

7. The rewriting of IndexWriter does not implement all the functions, only the most basic addDocument

8. The rewriting and extension of IndexSearch is not enough.

This is the answer to the question about the combined use of Lucene and HBase and the report analysis of HBasene. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.