Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the classic interview questions of big data Hbase?

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Editor to share with you what big data Hbase classic interview questions are, I believe most people do not know much, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

1. What is Hbase? What are the characteristics of hbase?

Hbase is a distributed database based on column storage, hdfs storage based on Hadoop, and managed by zookeeper.

Hbase is suitable for storing semi-structured or unstructured data, and it is difficult to extract data according to a concept when the data structure field is not definite or disorganized.

Records whose Hbase is null are not stored.

Based on tables that contain rowkey, timestamps, and column families. When new data is written, the timestamp is updated and the previous version can be queried.

Hbase is the master-slave architecture. Hmaster is the master node and hregionserver is the slave node.

How does 2.Hbase import data?

Write data in batches through HBase API

Batch derivative to a HBase cluster using the Sqoop tool

Bulk Import using MapReduce

The HBase BulkLoad way.

The storage structure of 3.Hbase?

Each table in Hbase is divided into multiple child tables (HRegion) according to a certain range by row key (rowkey). By default, if a HRegion exceeds 256m, it is divided into two.

HRegionServer management, which manages which HRegion is allocated by Hmaster. When HRegion accesses a child table, it creates a HRegion object, and then creates an store instance for each Column Family of the table. Each store has 0 or more StoreFile corresponding to it, and each StoreFile corresponds to a HFile. HFile is the actual storage file, so a HRegion also has an instance of MemStore.

4. What's the difference between Hbase and hive? What is the underlying storage for hive and hbase? What is the cause of hive? What is the purpose of habase to make up for the shortcomings of hadoop?

What they have in common:

Both hbase and hive are built on top of hadoop. All use hadoop as the underlying storage.

Difference:

Hive is a batch processing system based on Hadoop to reduce the writing work of MapReducejobs, and HBase is a project to support a project to make up for Hadoop's shortcomings in real-time operation.

Imagine you are operating a RMDB database. If it is a full table scan, use Hive+Hadoop, and if it is index access, use HBase+Hadoop.

Hive query means that MapReduce jobs can run from 5 minutes to more than a few hours. HBase is very efficient and definitely much more efficient than Hive.

Hive itself does not store and calculate data, it is completely dependent on table pure logic in HDFS and MapReduce,Hive

Hive borrows MapReduce from hadoop to execute some commands in hive.

Hbase is a physical table, not a logical table. It provides a super-large memory hash table, which is used by search engines to store indexes to facilitate query operations.

Hbase is column storage

Hdfs is the underlying storage, hdfs is the system for storing files, and Hbase is responsible for organizing files.

Hive needs to use hdfs storage files and the MapReduce computing framework.

5. Explain the principle of Hbase real-time query

Real-time query can be thought of as a query from memory, and the general response time is less than 1 second. The mechanism of HBase is that the data is first written to memory, and then written to disk when the amount of data reaches a certain amount (such as 128m). In memory, there is no data update or merge operation, only data is added, which makes the user's write operation return immediately as soon as it enters the memory, which ensures the high performance of HBase I write O.

6. Describe the design principles of Hbase's rowKey

Contact region and rowkey relationship description, the design can refer to the following three principles.

Rowkey length principle

Rowkey is a binary code stream, which can be any string with a maximum length of 64kb. In practical applications, it is generally 10-100bytes, saved in the form of byte [], and is generally designed to have a fixed length. It is recommended that the shorter the better, no more than 16 bytes, for the following reasons:

The persistence file HFile of data is stored according to KeyValue. If the rowkey is too long, it will greatly affect the storage efficiency of HFile. MemStore will cache part of the data to memory. If the rowkey field is too long, the effective utilization of memory will be reduced, and the system can not cache more data, which will reduce the efficiency of retrieval.

Rowkey hashing principle

If the rowkey is incremented by timestamp, do not put the time in front of the binary code. It is recommended to use the high bit of the rowkey as a hash field, which is randomly generated by the program, and the low bit time field. This will increase the probability that the data will be evenly distributed in each RegionServer to achieve load balancing. If there is no hash field, the first field is directly time information, and all data will be concentrated on one RegionServer, so that the load will be concentrated on individual RegionServer during data retrieval, which will cause hot issues and reduce query efficiency.

The unique principle of rowkey

It must be designed to ensure its uniqueness. Rowkey is sorted and stored in dictionary order. Therefore, when designing rowkey, we should take full advantage of this sorting feature, store frequently read data together, and put together data that may be accessed recently.

7. Describe the functions and similarities and differences of scan and get in Hbase

According to the specified RowKey, there is only one entry. The get method (org.apache.hadoop.hbase.client.Get) Get can be processed in two ways: the ClosestRowBefore is set and the rowlock is not set, which is mainly used to ensure the transactionality of the row, that is, each get is marked with a row. There can be many family and column in a row.

Obtain a batch of records according to the specified conditions. The scan method (org.apache.Hadoop.hbase.client.Scan) uses scan to implement the conditional query function.

Scan can increase speed (space for time) through setCaching and setBatch methods.

Scan can use setStartRow and setEndRow to limit the scope ([start,end] start? Is a closed interval, end is an open interval). The smaller the range, the higher the performance.

Scan can add filters through the setFilter method, which is the basis of paging and multi-conditional queries. 3. Full table scan, that is, directly scan all row records in the entire table.

8. Please describe in detail the structure of a Cell in Hbase

The storage unit determined by row and columns in HBase is called cell. Cell: the data in the unit cell that is only determined by {row key, column (= +), version} is untyped and is all stored in bytecode form.

9. Briefly describe the use of compact in HBASE, when it is triggered, which two types are divided into, what are the differences, and what are the relevant configuration parameters?

In hbase, whenever there is memstore data flush to disk, a storefile is formed. When the number of storeFile reaches a certain level, it is necessary to compaction the storefile file. The role of Compact:

Merge Fil

Clear out-of-date, redundant versions of data

Improve the efficiency of reading and writing data

10. There are two ways to implement compaction in HBase: minor and major, the difference between the two compaction methods is:

The Minor operation is only used to merge some files, including minVersion=0 and set up the out-of-date version cleaning of TTL, and does not do any cleaning work of deleting data or multi-version data.

The Major operation performs a merge operation on all the StoreFile under the HStore under Region, and the final result is to sort out and merge a file.

11. What is the implementation principle of Hbase filter? Based on the actual project experience, write several scenarios that use filter.

HBase provides a set of filters for filtering data, through which you can filter data in multiple dimensions (rows, columns, data versions) of data in HBase, that is, the data that the filter can eventually filter can be refined into a specific storage cell (located by row key, column name, timestamp).

RowFilter 、 PrefixFilter . The filter of hbase is set through scan, so it is filtered based on the query results of scan. There are many types of filters, but they can be divided into two categories-comparative filters and special filters. The function of the filter is to determine whether the data meets the conditions on the server side, and then only return the data that meet the conditions to the client; for example, in order development, we use rowkeyfilter to filter out all orders of a user.

12. What is the internal mechanism of Hbase?

In HBase, whether you add new lines or modify existing lines, the internal process is the same. After receiving the command, HBase saves the change information, or fails to write and throws an exception. By default, when a write is performed, it is written to two places: write-ahead log (also known as HLog) and MemStore. The default way for HBase is to record writes in these two places to ensure data persistence. The write action is considered complete only when the change information in these two places is written and confirmed.

MemStore is a write buffer in memory where data in HBase accumulates before it is permanently written to the hard disk. When the MemStore fills up, the data is written to the hard disk and a HFile is generated. HFile is the underlying storage format used by HBase. HFile corresponds to column families. A column family can have more than one HFile, but one HFile cannot store data for multiple column families. On each node of the cluster, each column family has a MemStore. Hardware failures are common in large distributed systems, and HBase is no exception.

Imagine that if the MemStore is not written, the server crashes and data that is not written to the hard disk in memory will be lost. The response for HBase is to write to WAL before the write action is completed. Each server in the HBase cluster maintains a WAL to record changes. WAL is a file on the underlying file system. The write action is not considered to have completed successfully until the new WAL record is successfully written. This ensures that HBase and the file system that supports it are persistent.

In most cases, HBase uses the Hadoop distributed File system (HDFS) as the underlying file system. If the HBase server goes down, data that is not written from MemStore to HFile can be recovered by playing back WAL. You don't have to do it by hand. There is a recovery process part of the internal mechanism of Hbase. Each HBase server has a WAL, and all tables on this server (and their column families) share this WAL. You might think that skipping WAL while writing should improve write performance. However, we do not recommend disabling WAL unless you are willing to lose data if something goes wrong. If you want to test it, the following code can disable WAL: note: not writing WAL increases the risk of data loss in the event of a RegionServer failure. If WAL is turned off, HBase may not be able to recover data in the event of a failure, and all written data not written to the hard disk will be lost.

13. How to deal with HBase outage?

Downtime is divided into HMaster downtime and HRegisoner downtime.

If the HRegisoner is down, HMaster will redistribute the region it manages to other active RegionServer. Since the data and logs are persisted in the HDFS, this operation will not cause data loss, so the consistency and security of the data are guaranteed.

If HMaster is down, HMaster does not have a single point of problem. Multiple HMaster can be started in HBase, and a Master is always running through Zookeeper's Master Election mechanism. That is, ZooKeeper will ensure that there will always be a HMaster providing services.

14. How to deal with HRegionServer outage?

ZooKeeper will monitor the online and offline situation of HRegionServer. When ZK discovers that a certain HRegionServer is down, it will notify HMaster for failure backup.

HRegionServer will stop providing services, that is, the region it is responsible for will temporarily stop providing services.

HMaster transfers the region that the HRegionServer is responsible for to another HRegionServer, and recovers the data on the HRegionServer that has not been persisted to disk in the memstore.

This recovery is done by WAL replay, which is as follows:

Wal is actually a file. When a downtime occurs under the / hbase/WAL/ corresponding RegionServer path, read the wal file under the path corresponding to the RegionServer, and then split it into different temporary files recover.edits according to different region.

When region is assigned to a new RegionServer, RegionServer will read the region to see if there is a recover.edits, and if so, restore it.

15. Hbase data writing and reading process

Get region storage location information

Write data and read data generally get the location information of hbase's region. The approximate steps are as follows:

Get .Root from zookeeper. The location information of the table, which is stored in / hbase/root-region-server in zookeeper

According to .Root. The information in the table, get .meta. Location information of the table

META. The data stored in the table is stored for each region storage location

Insert data into the hbase table

The cache in hbase is divided into two layers: Memstore and BlockCache

First write to the WAL file in order not to lose data

Then insert the data into the Memstore cache, and when the Memstore reaches the set size threshold, the flush process will be carried out.

During the flush process, you need to get the location of each region storage.

Read data from hbase

BlockCache is mainly provided for reading. The read request will first check the data in Memtore, and if it cannot be found, it will be checked in BlockCache. If it cannot be found, it will be read on disk, and the read result will be put into BlockCache.

The algorithm used by BlockCache is LRU (the least recently used algorithm), so when BlockCache reaches the upper limit, it initiates the elimination mechanism to eliminate the oldest batch of data.

There is a BlockCache and N Memstore on a RegionServer, and the sum of their sizes cannot be greater than or equal to heapsize * 0.8, otherwise the hbase cannot be started. The default BlockCache is 0.2 and Memstore is 0.4. For systems that focus on read response time, you should set the BlockCache larger, such as setting BlockCache = 0.4 memstorekeeper 0.39. This increases the cache hit ratio.

16.HBase optimization method

The means of optimization mainly include the following four aspects

(1) reduce adjustment

How to understand the reduction and adjustment? There are several things in HBase that will be adjusted dynamically, such as region (partitioning) and HFile, so there are some ways to reduce these adjustment Region, which will lead to Ibank O overhead. If there are no pre-built partitions, then the region will split as the number of region increases, which will increase the I RowKey O overhead, so the solution is to pre-build partitions according to your RowKey design to reduce the dynamic splitting of region.

HFile

HFile is the underlying data storage file. When each memstore is refreshed, a HFile will be generated. When the HFile is increased to a certain extent, the HFile belonging to a region will be merged. This step will incur overhead but is inevitable. However, if the merged HFile size is greater than the set value, the HFile will split again. In order to reduce such unnecessary HFile O overhead, it is recommended to estimate the project data size and set an appropriate value for the project.

(2) reduce start and stop

Database transaction mechanism is to better achieve batch writing, less overhead caused by database opening and closing, then there are also problems caused by frequent opening and closing in HBase.

Turn off Compaction and do manual Compaction in your spare time

Because there are Minor Compaction and Major Compaction in HBase, that is, merging HFile, the so-called merge is Imax O read and write, a large number of HFile operations will definitely bring I HFile O overhead, or even Imax O storm, so in order to avoid this uncontrolled accident, it is recommended to turn off automatic Compaction and compaction in your spare time.

BulkLoad is used to write batch data.

If you write a large amount of data through HBase-Shell or JavaAPI's put, then poor performance is certain and may cause some unexpected problems, so it is recommended to use BulkLoad when you need to write a large amount of offline data.

(3) reduce the amount of data

Although we are developing big data, why not do it if we can ensure the accuracy of data and reduce the amount of data in some way?

Turn on filtering to improve query speed

Enabling BloomFilter,BloomFilter is column family-level filtering. When a StoreFile is generated, a MetaBlock is generated to filter data when querying.

Use compression: Snappy and LZO compression are generally recommended

(4) reasonable design

The design of RowKey and ColumnFamily is very important in a HBase table. Good design can improve performance and ensure the accuracy of data.

RowKey design: should have the following attributes

Hashing: hashing ensures the same similar rowkey aggregation, while different rowkey is dispersed, which is conducive to query.

Brevity: rowkey is stored in HFile as part of key, and if the rowKey is designed to be too long for readability, it will increase the storage pressure

Uniqueness: rowKey must be clearly differentiated

Operational: give some examples

If my query conditions are more, and not for column conditions, then the design of rowKey should support multi-conditional queries

If my query requires recently inserted data first, then rowKey can use the method of calling Long.Max- timestamp, so that rowKey is a descending arrangement.

Advantages:

If the data in HBase is stored by column, then there is no need to scan a column when querying a column of a column family, only a column family needs to be scanned, which reduces the reading of Imax O; in fact, the effect of multi-column family design on reduction is not very obvious, so it is suitable for scenarios with more reading and less writing.

Disadvantages:

Reduced the performance of writing IBG O. The reasons are as follows: after the data is written to store, it is cached in memstore first, while there are multiple store in the same region, and each store has a memstore. When the memstore is actually flush, the memstore in the store belonging to the same region will be flush, which increases the cost of flush.

The above is all the contents of the article "what are the Classic interview questions of big data Hbase". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report