The solution of HBase Secondary Index 07/12 Update SLTechnology News&Howtos

The solution of HBase Secondary Index

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the solution to the HBase secondary index. The quality of the article is high, so Xiaobian shares it with you for reference. I hope you have a certain understanding of relevant knowledge after reading this article.

One of the sad things about HBase is that it doesn't support secondary indexing. As a result, the community has a number of complementary solutions to fill in the gaps in HBase's secondary indexing capabilities.

Today, let's take a look at which secondary indexing schemes are available, compare the advantages and disadvantages of each scheme, and make a secondary indexing scheme selection in combination with our specific scenarios.

1. Why do I need a secondary index?

HBase system focuses on scalability, fault tolerance, read and write performance of distributed storage from the perspective of solving real-time reading and writing problems of big data, and sacrifices many functions of traditional relational databases, such as transactions, SQL expression and analysis.

In fact, this is the original meaning of NoSQL, to solve the real-time access of big data as the primary goal, to provide a simple Get, Put, Scan interface, to solve the user's large data storage needs. Therefore, HBase is a very good real-time access engine for big data, solving the capacity problem of traditional databases.

As far as the current official HBase system is concerned, it does not support secondary index, only rowkey is used as primary index. If data retrieval and query are to be carried out on non-rowkey fields in the library, it is often carried out through distributed computing frameworks such as MapReduce/Spark, and hardware resource consumption and time delay will be relatively high.

In order to make HBase's data query more efficient and adapt to more scenarios, such as using non-rowkey field retrieval to achieve second-level response, or supporting fuzzy query and multi-field combination query for each field, it is necessary to build a secondary index on the basis of native HBase to meet more complex and diverse business requirements in reality. There are generally three types of programmes:

HBase-based Coprocessor scheme (typically stands for phoenix)

Secondary index self-developed by cloud manufacturers (Alibaba Cloud currently has self-developed enhanced secondary index)

Indexing schemes based on search platforms (e.g. solr, ES, etc.).

2. How to choose a secondary indexing scheme

We compare the three types of solutions in terms of literacy performance, usage constraints, learning costs, and community activism.

HBase-based Coprocessor scheme (typically stands for phoenix)

Official Document: phoenix.apache.org/secondary_indexing.html

Read/write performance: There is some damage to read/write performance. The more indexes, the greater the impact on write performance.

TTL function: better support

Index usage limit: the number of indexes in a table should not exceed 10

Index type: global index, local index, overlay index

Learning cost: sql syntax of JDBC class, refer to official syntax (http://phoenix.apache.org/language/index.html)

Open source/community activity: open source, currently not very active community

Advantages: Multiple community documents, easy to use, real-time query without delay

Disadvantages: Non-commercial solution, no specialized technical support

Cloud vendor self-developed secondary index (typically representing Alibaba Cloud self-developed enhanced version secondary index)

Official Document: help.aliyun.com/document_detail/144577.html? spm=a2c4g.11174283.6.576.4999363f2uZWt0

Read and write performance: secondary index built into HBase, official performance evaluation documents say better than Phoenix

TTL function: valid only on single-column index

Index usage limit: the maximum number of indexes in a table shall not exceed 5, and the maximum number of columns in a combined index shall not exceed 3.

Index type: global index, local index, overlay index

Learning cost: the internal encapsulation is very simple, in use is the native usage of HBase

Open source/community activity: non-open source, Alibaba Cloud private

Advantages: good performance, real-time query without delay

Weaknesses: Locked in by cloud vendors

Secondary indexing scheme based on search platform (Solr as an example)

Official Document: phoenix.apache.org/secondary_indexing.html

Read/write performance: some read/write performance damage, data synchronization delay needs to be considered

TTL function: HBase is the expiration of a single KV, while Solr can only expire according to Document (a line corresponding to HBase), and the expiration time is not exactly the same.

Index usage restrictions: no restrictions, completely dependent on solr

Index type: Very flexible

Learning costs: familiarity with solr syntax required

Open Source/Community Activity: Active Open Source Community

Advantages: More flexible query mode

Cons: Introduction of search engine components, too heavy, and significant latency issues

To sum up (especially important technology selection strategy):

In order not to be locked in by cloud vendors, the unique secondary index scheme of cloud vendors is not adopted.

For scenes with high real-time requirements and a small number of indexes, phoenix can be used completely. It is simple and silky.

For those with low real-time requirements and complex search scenarios, search engines, such as solr or es, need to be introduced for index construction.

In general, to meet real-time requirements, we use phoenix.

3. Learn more about Phoenix

In order to make HBase more powerful, easier to use, and more accessible, HBase helps more users solve real-world problems they encounter. Phoenix was born with SQL. As we all know, SQL is a language standard in the field of data processing, simple, easy to use, strong expression, and widely used by users. Of course, the implementation and development of HBase SQL are very different from traditional stand-alone databases, which is easy to distinguish. We call it NewSQL. This is also the original intention of the community to try HBase secondary index, if HBase is a powerful storage engine, then after supporting NewSQL, it becomes a new generation of big aircraft.

Phoenix acts as a middleware between the application layer and HBASE, and the following features make it uniquely advantageous for simple query scenarios with large data volumes

Secondary index support (global index + local index)

Compile SQL into a parallel-executable scan of native HBASE

Computations are done at the data layer, and aggregation is performed by the server-side copprocessor.

Push down the where filter condition to the scan filter on the server side

Skip scan function improves scanning speed

About HBase secondary index solution shared here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.