Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Why not recommend using too many column families in HBase

2025-04-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces why it is not recommended to use too many column families in HBase, the content is very detailed, interested friends can refer to, hope to be helpful to you.

We know that an HBase table contains one or more column families. In the official document of HBase, there are two descriptions about the number of column families of HBase: A typical schema has between 1 and 3 column families per table. HBase tables should not be designed to mimic RDBMS tables. And HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low.

The above two sentences actually say the same thing. It is recommended that the number of column families in each table in HBase should be set between 1 and 3. In fact, there is no limit to the number of column families supported by HBase, but why is the document recommended between 1 and 3? I will explain the reasons for this in several ways.

The influence of the number of column families on Flush

In HBase, calling API to insert data into the corresponding table is written to MemStore, while MemStore is a memory structure, with each column family corresponding to a MemStore (and zero or more HFile). If our table has two column families, then there are two MemStore in the corresponding Region, as shown below:

As can be seen from the above figure, the more column families, the more MemStore; will exist in memory, and the data stored in MemStore will be operated by Flush when certain conditions are met. Each Flush will produce a HFile file on disk, as shown below:

This causes more column families to end up with more HFile persisted to disk. What's more, the current Flush operation is Region-level, that is, if a MemStore in the Region is Flush, other MemStore of the same Region will also perform the Flush operation. When the table has many column families, and the data between the column families is uneven, for example, a column family has 100W rows, and a column family has only 10 rows, which will result in a large number of files persisted to disk, there are many small files at the same time, and each Flush operation also involves a certain IO operation.

The influence of the number of column families on Split

We know that when a Region in the HBase table is too large (for example, larger than the size of the hbase.hregion.max.filesize configuration. Of course, Region splitting does not mean that the total Region size is larger than hbase.hregion.max.filesize, but that one of the largest Store/HFile/storeFile in Region greater than hbase.hregion.max.filesize will trigger Region split), which will be split into two. If we have many column families, and the amount of data between these column families varies greatly, for example, some column families have 100W rows, while some column families have only 10 rows, this will cause the HFile files with very small amount of data to be further split at the time of Region Split, resulting in more small files. Note that Region Split is for all column families, so that the data in the same row exists in the same Region even after Split.

The influence of the number of column families on Compaction

Like Flush operations, the current Compaction operations of HBase are at the Region level, and too many column families can lead to unnecessary IO.

The influence of the number of column families on HDFS

We know that HDFS actually has a limit on the number of files in a directory (dfs.namenode.fs-limits.max-directory-items). If we have N column families, M Region, then we persist to HDFS will produce at least NM files; and each column family corresponding to the underlying HFile file is often more than one, we assume that K, then the final table in the HDFS directory of the number of files will be NM*K, which may operate the HDFS limit.

The influence of column Family number on RegionServer memory

As mentioned earlier, a column family corresponds to a MemStore in RegionServer. HBase has introduced MSLAB (Memstore-Local Allocation Buffers, reference HBASE-3455) since version 0.90.1, which is turned on by default (via hbase.hregion.memstore.mslab.enabled), which makes each MemStore take up the buffer of 2MB (configured through hbase.hregion.memstore.mslab.chunksize) in memory. If we have a lot of column families, then the MemStore cache alone will take up a lot of memory.

Suggestions on setting the number of column families

Before setting up column families, we'd better think about whether it is necessary to put different columns into different column families. You'd better put a column family if it is not necessary. If you really want to set up multiple column families, but some of them are very different from other column families, such as 1000W compared to 100 rows, should you consider using another table to store relatively small column families?

On why it is not recommended to use too many column families in HBase to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report