How to analyze the principle, basic operation mode and optimization of Hbase 07/03 Update SLTechnology News&Howtos

How to analyze the principle, basic operation mode and optimization of Hbase

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to analyze the principle of Hbase and the basic mode of operation and optimization, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

HBase is a distributed column storage system built on HDFS.

HBase is developed based on Google BigTable model, a typical key/value system.

HBase is an important member of Apache Hadoop ecosystem, which is mainly used for massive unstructured data storage.

Logically, HBase stores data by table, row, and column.

Like hadoop, the Hbase goal relies mainly on scale-out, increasing computing and storage capacity by adding cheap commercial servers

To sum up, we all know that Hbase is a HDFS-based column database.

Characteristics of Hbase:

BIGTABLE: the so-called big table, a table can have billions of rows and millions of columns.

Column-oriented: column-oriented storage and permission control, column (family) independent retrieval

Sparse: null columns do not take up storage space, and tables can be designed to be very sparse

Multiple versions of data: there can be multiple versions of the data in each cell, and the version number is automatically assigned by default, which is the timestamp when the cell is inserted; (therefore, there is no concept of modification in Hbase, if the modification is to add data, but the timestamp has changed. The data from the query has changed. )

Single data type: the data in Hbase is a string and has no type.

Note: I need to explain for strings: the most suitable data for Hbase storage is very sparse data (unstructured or semi-structured data). The reason why Hbase is good at storing this kind of data is that Hbase is the column-oriented storage mechanism of column-oriented, while the well-known RDBMS is the row-oriented storage mechanism of row- oriented.

Structured data: structured information, the information managed by the databases we usually come into contact with, including records of production, business, transactions, customer information, etc.

Unstructured data: unstructured data, including all formats of office documents, text, pictures, XML, HTML, various reports, images and audio / video information, etc.

Analysis: in many large videos like Qiyi, Sohu, Tencent Video Youku. Most of their resources may be unstructured data.

Hbase storage model:

The basic elements of HBase:

Tables, rows, columns, cells: basic elements of a table

Key: generally refers to the key of a row, that is, an element that uniquely identifies a row. The rows in the table can be sorted by key, and the access to the table is also through the key.

Column families: all column family members have the same prefix. Members of a column family need to be pre-defined, but they can also be appended directly.

Column family members are put into memory together. HBase column-oriented storage is column family-oriented data storage, data storage and tuning are at this level, HBase table is similar to RDBMS table, rows are sorted, the client can add columns to the column family.

Cell cell: the cell holds an indivisible array of bytes. And each cell has version information. HBase is sorted in reverse order by version information.

Region region: divides the table horizontally, which is the smallest unit of HBase cluster distribution data. All the areas online make up the contents of the table.

The storage principle of Hbase:

Automatic partitioning: (very similar to hadoopHDFS)

A table in Hbase is divided into many Region, which can be dynamically expanded to ensure the load balance of the whole system.

When a Region reaches the upper limit, it automatically splits two equal Region. (the principle is split and compaction in Hbase)

Each Region is managed by a RegionServer, and a RegionServer can manage multiple Region.

4. It is appropriate for RgionServer to manage 100-1000 region. The size of Region is generally 1-20GB.

Table design optimization:

HBase is a distributed database with high reliability, high performance, column-oriented and scalable, but when the concurrency is too high or the amount of existing data is too large, the read and write performance will decline. We can use the following ways to gradually improve the retrieval speed of HBase.

Pre-zoning

By default, an Region partition is automatically created when the HBase table is created, and when the data is imported, all HBase clients write to this Region until the Region is large enough. One way to speed up batch writes is to create some empty Regions in advance, so that when the data is written to HBase, it will load balance the data in the cluster according to the Region partition.

Rowkey optimization

Rowkey in HBase is stored in lexicographic order, so when designing Rowkey, we should make full use of sorting features, store the data that is often read together, and put together the data that may be accessed recently.

In addition, if the Rowkey is generated incrementally, it is recommended not to write the Rowkey directly in positive order, but to reverse the Rowkey by reverse, so that the Rowkey is roughly evenly distributed. A benefit of this design is that the load of the RegionServer can be balanced, otherwise it is easy to cause all the new data to accumulate on a RegionServer, which can also be designed in combination with the pre-splitting of the table.

Reduce the number of ColumnFamily

Don't define too many ColumnFamily in one table. Currently, Hbase does not handle tables with more than 2-3 ColumnFamily very well. Because when a ColumnFamily is in flush, its adjacent ColumnFamily will also be triggered by flush because of the correlation effect, which eventually leads to more IBO in the system.

Caching Policy (setCaching)

When you create a table, you can use HColumnDescriptor.setInMemory (true) to put the table in the RegionServer cache to ensure that it is hit by cache when reading.

Set the storage life period

When you create a table, you can set the storage life of the data in the table through HColumnDescriptor.setTimeToLive (int timeToLive), and the expired data will be deleted automatically.

Hard disk configuration

Each RegionServer manages 10g / 1000 Regions, and each Region is 1g / 2G, then each Server needs at least 10G, and the maximum is 1000*2G=2TB. If 3 backups are considered, 6TB is required. The first scheme is to use 3 2TB hard drives and the other is to use 12 500G hard disks. When the bandwidth is sufficient, the latter can provide greater throughput, finer-grained redundant backups, and faster single-disk failure recovery.

Allocate appropriate memory to the RegionServer service

Without affecting other services, the bigger the better. For example, add export HBASE_REGIONSERVER_OPTS= "- Xmx16000m $HBASE_REGIONSERVER_OPTS" at the end of hbase-env.sh under HBase's conf directory

Where 16000m is the amount of memory allocated to RegionServer.

Number of backups of write data

The number of backups is proportional to read performance and inversely proportional to write performance, and the number of backups affects high availability. There are two configuration methods, one is to copy hdfs-site.xml to the conf directory of hbase, and then add or modify the value of the configuration item dfs.replication to the number of backups to be set. This modification takes effect for all HBase user tables. The other way is to rewrite HBase code so that HBase supports setting the number of backups for column families. When creating the table, set the number of backups for column families. The default is 3. This number of backups only works for the set column families.

WAL (pre-written log)

The switch can be set to indicate that HBase does not need to write logs before writing data. It is turned on by default, and turning it off will improve performance, but if the system fails (the RegionServer responsible for inserting is dead), the data may be lost. Configure WAL when calling Java API write, set the WAL of the Put instance and call Put.setWriteToWAL (boolean).

Batch write

HBase's Put supports both single and batch inserts. Generally speaking, batch writes are faster and save network overhead. When the client calls Java API, first put the batch Put into a Put list, and then call the Put (Put list) function of HTable to write in batches.

The number of clients pulled from the server at a time

The time it takes for the client to obtain data can be reduced by configuring a large amount of data pulled at once, but it takes up client memory. There are three places to configure:

1) configure hbase.client.scanner.caching in the conf configuration file of HBase

2) configure by calling HTable.setScannerCaching (int scannerCaching)

3) configure by calling Scan.setCaching (int caching). The priority of the three is getting higher and higher.

Number of request processing IO threads for RegionServer

Fewer IO threads are suitable for Big Put scenarios with high memory consumption for a single request (large capacity single Put or Scan with larger cache, both belong to Big Put) or ReigonServer scenarios where memory is tight.

More IO threads are suitable for scenarios with low memory consumption for a single request and very high TPS requirements (transactions per second (TransactionPerSecond)). When setting this value, take the monitoring memory as the main reference.

The configuration item in the hbase-site.xml configuration file is hbase.regionserver.handler.count.

Region size settin

The configuration item is hbase.hregion.max.filesize, and the profile is the default size of 256m for hbase-site.xml.,.

The maximum storage space of a single Reigon on the current ReigonServer, when a single Region exceeds this value, the Region is automatically split into a smaller Region. Little Region is friendly to split and compaction because splitting StoreFile in Region or compact small Region is fast and has a low memory footprint. The disadvantage is that split and compaction are very frequent, especially a large number of small Region keep split and compaction, which will lead to great fluctuations in cluster response time. Too much Region will not only bring trouble to management, but also lead to some Hbase bug. Generally speaking, those under 512m are considered as small Region. Large Region is not suitable for regular split and compaction, because doing a compact and split will cause a long pause, which has a great impact on the read and write performance of the application.

In addition, it is also a memory challenge when large Region means large StoreFile,compaction. If the traffic is low at a certain point in your application scenario, doing compact and split at this time can not only successfully complete split and compaction, but also ensure smooth read and write performance most of the time. Compaction is inevitable, and split can be adjusted from automatic to manual. You can disable automatic split indirectly by increasing this parameter value to a value that is difficult to reach, such as 100G (RegionServer does not split Region that does not reach 100G). In conjunction with the RegionSplitter tool, manual split is needed when split is needed. Manual split is much more flexible and stable than automatic split, and the management cost does not increase much, so it is recommended to use online real-time system. In terms of memory, a small Region is more flexible in setting the size of memstore, while a large Region is not too big or too small. Too much flush leads to an increase in the IO wait of app, and too small will affect the read performance because of too much StoreFile.

On how to analyze the principle of Hbase and the basic mode of operation and optimization questions to share here, I hope the above content can be of some help to you, if you still have a lot of doubts to be solved, you can follow the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.