In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "what is the difference between HBase and relational database". In daily operation, I believe that many people have doubts about the difference between HBase and relational database. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubt of "what is the difference between HBase and relational database". Next, please follow the editor to study!
1 Analysis of HBase
1.1 what is HBase
HBase is a column-oriented NoSQL database for storing and processing large amounts of data. Its theoretical prototype is Google's BigTable paper. You can think of HBase as a highly reliable, high-performance, column-oriented, scalable distributed storage system.
The storage of HBase is based on HDFS. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware. Based on Hadoop means that HBase is born with super scalability and throughput.
HBase adopts the storage mode of key/value, which means that even as the amount of data increases, it will hardly lead to a decline in query performance. HBase is also a database for column storage. When there are many fields in the table, several of them can be placed independently on one part of the machine, while the other fields can be placed on another part of the machine, which fully disperses the pressure of the load. The cost of such a complex storage structure and distributed storage is that even if very little data is stored, it will not be very fast.
HBase is not fast enough, but it is not obvious when the amount of data is large. HBase is mainly used in the following two situations:
The amount of data in a single table is more than 10 million, and the concurrency is very large.
The need for data analysis is weak, or less real-time and flexible.
1.2 the origin of HBase
We know that Mysql is a relational database, and the first person to come into contact with a database is MySQL. However, the performance bottleneck of MySQL is very large. Generally, the number of rows in a single table should not exceed 5 million rows, and the size should not exceed 2G.
Let's take the core user table of an Internet company as an example. When the amount of data reaches 10 million or even 100 million, although you can speed up the query through various optimizations, the retrieval of a single piece of data will take longer than you expected. Take a look at this User table:
If we query the user name corresponding to the id=1 data, the system will return aa to us. But because MySQL is stored in behavior bit units, when looking up name, you need to query a whole row of data, even age and email will be found! If there are so many columns, then the query efficiency can be imagined.
We call a table with too many columns a wide table, and the optimization method is to split the columns vertically:
At this time, when looking for name, you only need to look up the user_basic table, and there are no extra fields, and the query efficiency will be very fast. If a table has too many rows, it will affect the query efficiency. We call such a table a high table, and we can improve the efficiency by splitting the table horizontally:
The scenario where this horizontal split is widely used is the log table, which generates a lot of log information every day, which can be split horizontally by month / day, so that the high table becomes shorter.
The above split method seems to solve the problem of wide table and high table, but if one day the company's business changes, for example, there is no Wechat, now you need to add the user's Wechat field. What should I do if the structural information of the table needs to be changed at this time? The simplest idea is to add one more column, like this:
But you should know that not all users need WeChat account, so whether the WeChat account column is set by default or by other measures has to be weighed. If you need to extend a lot of lists, but not all users have these properties, then the extension is more complicated. At this time, you can use the string in JSON format to summarize several optional information, and the attribute field can be expanded dynamically, so you have the following approach:
At this point, you may think that it is not very good to store data like this, what are you doing with HBase? Mysql has a fatal disadvantage, that is, when the data reaches a certain threshold, no matter how optimized it is, it can not achieve high performance. And big data field data, often PB-level data, this kind of storage application obviously can not meet the demand! And for the above problems, HBase has a very good solution.
1.3Design ideas of HBase
Then the above mentioned several problems: high table, wide table, data column dynamic expansion, the above solutions: horizontal segmentation, vertical segmentation, column expansion methods are mixed together.
There is a table, and you are afraid of its wide and high-heeled dynamic expansion column, so at the beginning of the design, take the table apart and store the JSON format directly for the dynamic expansion of the column:
This solves the problem of wide table and column expansion, what about the high table? A table is divided into partition by row, each with a branch:
After solving the problems of high table, wide table and dynamically expanding columns, you will find that the amount of data is large and the speed is not fast enough. Use the cache, slow down the storage of the queried data, and get the data directly from the cache next time. What about inserting data? It can also be understood that I put the data to be inserted into the cache, no longer care, the database directly from the cache to insert the data into the database. At this time, the program does not need to wait for the data to be inserted successfully, which improves the efficiency of parallel work.
What if you use the cache to consider that the data in the cache is not inserted into the database in time after the server goes down, resulting in the loss of data? Refer to the persistence policy of Redis, you can add an operation log for inserting data, which can be used to persist the insert operation and recover from the log after being restarted when a crash occurs. So the design architecture looks like this:
This is the general idea of HBase implementation. Next, we officially enter the HBase design analysis.
2 introduction to Hbase
Hbase official website: http://hbase.apache.org
2.1 HBase featur
Mass storage
HBase is suitable for storing huge amounts of data at the PB level and can return data in tens to 100 milliseconds.
Column storage
HBase stores data based on column families. There can be many columns under the column family, and the column family must be specified when you create the table.
High concurrency
In the case of concurrency, the single IO latency of HBase does not decrease much, and the service with high concurrency and low latency can be obtained.
Sparsity
HBase columns are flexible. In the column family, you can specify as many columns as you want. If the column data is empty, it will not take up storage space.
Very easy to expand
Based on the expansion of RegionServer, horizontal expansion is carried out by adding RegionSever machines horizontally to enhance the processing capacity of the upper layer of HBase and enhance the ability of HBase services to provide more Region.
Storage based extension (HDFS).
2.2 HBase logical structure
The storage model of HBase at the logical thinking level is as follows:
Table (Table):
A table consists of one or more column families. The properties of the data, such as name, age, TTL (timeout), and so on, are defined in the column family. The table that defines the column family is an empty table, and the table does not have data until data rows have been added.
Column (column):
Each column in HBase is qualified by Column Family (column family) and Column Qualifier (column qualifier), such as info:name, info:age. When creating a table, you only need to specify the column family, and the column qualifier does not need to be pre-defined.
Column Family (column families):
Multiple columns are grouped into a single column family. There is no need to create columns when building a table, columns can be increased or decreased in HBase! The only thing to determine is the column families, several of which are defined at the beginning of creation in the table. Many attributes of the table, such as data expiration, block caching, and whether to use compression, are defined on the column family.
HBase will put several columns of the same column family on the same machine as much as possible.
Row (line):
A row contains multiple columns, which are classified by column families. The column family to which the data in the row belongs is selected from the column family defined by the table. Because HBase is a column-oriented database, the data in a row can be distributed on different servers.
RowKey (line key):
RowKey is similar to the primary key in MySQL. In HBase, RowKey must exist and RowKey is sorted by dictionary. If the user does not specify RowKey, the system will automatically generate non-repeating strings. Data can only be retrieved according to RowKey when querying, so the RowKey design of Table is very important.
Region (area):
A Region is a collection of rows of data. The Region in HBase will be split dynamically according to the amount of data. Region is implemented based on HDFS, and the access operations about Region are completed by calling the HDFS client. The Region of the same row key is not split into multiple Region servers.
Region is a bit like partitioning relational data. The data is stored in Region. Of course, there are many structures under Region. Specifically, the data is stored in MemStore and HFile. When accessing HBase, first go to the HBase system table to find and locate which Region the record belongs to, then locate the server to which the Region belongs, and then go to which server to find the data in the corresponding Region.
RegionServer:
RegionServer is the container where Region is stored. Intuitively speaking, it is a service on the server. Responsible for managing and maintaining Region.
2.3 HBase physical Stora
The above is just a basic logical structure, and the underlying physical storage structure is the top priority. See the figure below.
NameSpace:
Namespace, similar to the concept of relational database DatabBase, where there are multiple tables under each namespace. HBase has two built-in namespaces, one in hbase and one in default,hbase is the built-in table in HBase, and the default table is the user's default namespace.
TimeStamp:
Timestamp, which is used to identify a different version of the data (version). If you do not specify a timestamp when each piece of data is written, the system will automatically add the time it was written to the HBase. And when reading the data, generally only take out the Type of the data to match, timestamp the latest data. The reason for fetching data according to Type is that the underlying HDFS of HBase supports adding, deleting and searching, but does not support modification.
Cell:
Cell, a cell uniquely determined by {rowkey, column Family:column Qualifier, time Stamp}. The data in cell is typeless and is all stored in bytecode form.
3 HBase underlying architecture
3.1 Client
Client contains interfaces to access Hbase, and Client also maintains a corresponding cache to speed up Hbase access, such as caching metadata information.
3.2 Zookeeper
HBase uses Zookeeper to do high availability of Master, monitoring of RegionServer, entry of metadata and maintenance of cluster configuration. The responsibilities of Zookeeper are as follows:
Only one Master is running in the cluster through Zoopkeeper. If an exception occurs in the Master, a new Master will be generated through the competition mechanism to provide the service.
Monitor the status of RegionServer through Zoopkeeper, and notify MasterRegionServer of online and offline information in the form of callback when there is an exception in RegionSevrer.
Stores the unified entry address of the metadata hbase:meata through Zoopkeeper.
3.3 Master
Master's position in HBase is much weaker than other types of clusters! The read and write operation of the data has nothing to do with him, and after it is dead, the cluster still runs. However, Master can not be down for too long, and many necessary operations are needed, such as creating tables, modifying column family configuration, and so on, to split and merge DDL and Region.
Responsible for assigning Region to a specific RegionServer at startup.
Find the failed Region and assign the failed Region to the normal RegionServer.
Manage the load balancing of HRegion servers and adjust the HRegion distribution.
Responsible for the allocation of the new HRegion after the HRegion split.
Multiple Master can be started in HBase, and there is always one Master running through the Master Election mechanism of Zookeeper.
3.4 RegionServer
HregionServer directly connects users' read and write requests and is the real working node. Its functions are summarized as follows:
Manage the Region assigned to it by Master.
Process read and write requests from the client.
Responsible for interacting with the underlying HDFS and storing data to HDFS.
Responsible for the split of Region when it gets bigger.
Responsible for the merger of StoreFile.
ZooKeeper will monitor the online and offline status of RegionServer, and when ZK discovers that a HRegionServer is down, it will notify Master for failure backup. The Region responsible for the offline RegionServer temporarily stops providing services, and the Master will transfer the Region responsible for the RegionServer to other RegionServer, and the data on the offline RegionServer that has not been persisted to the disk will be restored by WAL replay.
3.5 WAL
WAL (Write-Ahead-Log) pre-write log is a kind of log that HBase's RegionServer uses to record operations during the process of data insertion and deletion. Every time a record such as Put, Delete, etc., is written to the HLog file corresponding to RegionServer. Only when the WAL log is written successfully will the client be told that the data was submitted successfully. If you fail to write WAL, you will tell the client that the submission failed, which is actually the process of landing the data.
WAL is a persistent file saved on HDFS. When the data reaches the Region, it is written to the WAL and then loaded into the MemStore. In this way, even if the Region goes down and there is no time to persist the operation, you can load the operation from the WAL and perform it when you restart it. Similar to Redis's AOF.
All Region on a RegionServer share a HLog, and a data submission is written to the WAL first, and then to the MenStore after a successful write. When the value of MenStore reaches a certain value, StoreFile is formed one by one.
WAL is enabled by default, or you can turn it off manually, so that the operation of adding, deleting and modifying will be faster. But this is done at the expense of data security. If you do not want to close WAL and do not want to consume so much resources every time, call the HDFS client for each change, and you can choose to write WAL asynchronously (the default interval is 1 second).
If you have studied the Shuffle (edits file) mechanism in Hadoop, you can guess that WAL in HBase is also a scrolling log data structure, a WAL instance contains multiple WAL files, and the conditions under which WAL is triggered to scroll are as follows.
The size of WAL exceeds a certain threshold.
The HDFS file block where the WAL file is located is almost full.
WAL archiving and deletion.
3.5 Region
Each Region has a starting RowKey and an ending RowKey, representing the range of stored Row. From the large picture, we can see that a Region has multiple Store, a Store corresponds to the data of a column family, and the Store is composed of MemStore and HFile.
3.6 Store
Store consists of two important parts: MemStore and HFile.
3.6.1 MemStore
Each Store has an instance of MemStore, and the data is written to the WAL and then put into the MemStore. MemStore is a memory storage object, and when the size of the MemStore reaches a threshold (the default 64MB), the MemStore is flush to the file, that is, a snapshot is generated. Currently, HBase will have a thread responsible for the flush operation of MemStore.
3.6.2 StoreFile
After the data in MemStore memory is written to a file, the underlying StoreFile,StoreFile is saved in HFile format. HBase determines whether the Region needs to be segmented by the size of the Store.
3.6.3 HFile
There are multiple HFile in Store, and each time you write and write, a HFile file is placed on the HDFS. The HFile file is also dynamically merged, which is the entity of the data store.
Here is a question: by the time the operation reaches Region, the data has been persisted to WAL before entering HFile, and WAL is on HDFS, so why load it from WAL into MemStore and then brush it into HFile?
Because HDFS supports file creation, append, delete, but can not be modified! But for the database, the order of the data is very important!
The first WAL persistence is to ensure the security of the data, unordered.
And then read into the MemStore, is for sorting and storage.
So the point of MemStore is to keep the data in RowKey dictionary order, not to do a cache to improve write efficiency.
3.7 HDFS
HDFS provides the final underlying data storage service for HBase. The underlying data of HBase is stored in HDFS in HFile format (similar to the underlying data storage format of hadoop), while HBase is supported with high availability (Hlog is stored in HDFS). The specific features are summarized as follows:
Provide underlying distributed storage services for metadata and table data
Multiple copies of data to ensure high reliability and high availability
4 HBase read and write
If we do the DML operation in the HBase cluster, we don't need to care about HMaster, we just need to get the hbase:meta data address from ZooKeeper, and then add and delete the data from RegionServer.
4.1 HBase Writing process
Client first visits zookeeper and / hbase/meta-region-server to get which Region Server the hbase:meta table is located in.
Access the corresponding Region Server, get the hbase:meta table, and query which Region in which Region Server the target data is located according to the namespace:table/rowkey of the read request. The Region information of the table and the location information of the meta table are cached in the meta cache of the client to facilitate the next visit.
Communicate with the target Region Server.
Write (append) the data sequentially to WAL.
The data is written to the corresponding MemStore, and the data is sorted in MemStore.
Send ack to the client, where you can see that the data does not have to be removed from the disk.
When the time for MemStore is reached, the data is brushed to HFile
When viewing the web page, a random number is randomly generated for each Region.
4.2 HBase Reading proc
Client first accesses ZooKeeper to get the Region Server in which the hbase:meta table is located.
Access the corresponding Region Server, get the hbase:meta table, and query which Region in which Region Server the target data is located according to the namespace:table/rowkey of the read request. The region information of the table and the location information of the meta table are cached in the meta cache of the client to facilitate the next visit.
Communicate with the target Region Server.
Query the target data in Block Cache (read cache), MemStore and Store File (HFile), respectively, and merge all the data found. All data here refers to different versions (time stamp) or different types (Put/Delete) of the same piece of data.
The blocks queried from the file HFile (Block,HFile data storage unit, default size is 64KB) are cached to Block Cache.
The final result of the merge is returned, and then the latest data is returned to the client.
4.2.1 Block Cache
HBase provides two cache structures, MemStore (write cache) and BlockCache (read cache), in the implementation. Write cache said earlier that it will not be repeated.
HBase caches Block blocks of a file lookup into Cache so that subsequent requests for the same or adjacent data lookups can be fetched directly from memory to avoid expensive IO operations.
BlockCache is Region Server level.
A Region Server has only one Block Cache, and initialization of the Block Cache is completed when the Region Server starts.
HBase's management of Block Cache is divided into the following three types.
LRUBlockCache is the original and default implementation scheme, which puts all the data into JVM Heap and gives it to JVM to manage.
SlabCache implements out-of-heap memory storage, and JVM no longer manages data memory. It is generally used in combination with the first one, but it does not improve the disadvantages of GC and introduces low utilization of out-of-heap memory.
BucketCache cache obsolescence is no longer managed by JVM, reducing the frequency of Full GC occurrence.
Key points:
When you read the data, you should not understand it as first reading from MemStore, then reading from BlockCache, then reading from HFile, and then writing the data to BlockCache. Because if the artificial setting causes the disk data new, the memory data old. You'll make a mistake when you read it!
Conclusion:
HBase reads the disk with memory data, and then puts the disk data into BlockCache. BlockCache is the cache of disk data. HBase is a tool that reads more slowly than writes.
4.3 Why does HBase write faster than read?
The main reason why HBase can provide real-time computing services is determined by its architecture and underlying data structure, that is, LSM-Tree (Log-Structured Merge-Tree) + HTable (Region partition) + Cache.
HBase writes fast because the data is not really dropped immediately, but is first written to memory and then asynchronously brushed into the HFile. So from the client's point of view, the write speed is very fast.
The data stored in HBase in memory is ordered, and so is the in-memory data when it is written to HFile. And multiple ordered HFile will be merged and sorted to produce a larger ordered HFile. Performance tests found that sequential read-write disks are at least three orders of magnitude faster than random read-write disks!
The read speed is fast because it uses the LSM tree structure, because the disk addressing time is much longer than the disk sequential reading time, and the architecture design of HBase allows us to control the number of disk addressing within the performance allowable range.
The principle of LSM tree divides a big tree into N small trees, which are first written into memory. As the small trees get bigger and bigger, the small trees in memory will flush to the disk, and the trees in the disk can be merged into a big tree by merge operation periodically to optimize read performance.
4.3.1 query examples
Based on the fact that RowKey can quickly find the Region where the row is located, suppose there are 1 billion records, occupying 1TB. Divided into 500 Region, then read the 2G record, you can find the corresponding record.
The rest is in memory.
The data in memory and disk are arranged in order, and the record you want may be at the front or at the back, assuming that in the middle, we only need to traverse 2.5 HStoreFile with a total of 300m.
Each HStoreFile (encapsulation of HFile) is stored in the form of key-value pairs (KV), as long as it traverses the location of the key in each data block and determines that it meets the conditions. Generally speaking, the key is of limited length. Assuming that the KV ratio is 1:19, you only need 15m to obtain the corresponding record. According to the access 100M/S of the disk, it only takes 0.15s. With Block Cache, higher efficiency can be achieved.
After a general understanding of the thinking of reading and writing, you will find that if you are clever enough in reading and writing, of course, the reading and writing speed is very fast.
5 HBase Flush
5.1 Flush
For users, writing data to MemStore is considered OK, but for the underlying code, it is only when the data is brushed to the hard disk that it is completely done! Because data is written to WAL (Hlog) and then to MemStore, flush has the following opportunities.
When the number of WAL files exceeds the set value, Region will write in chronological order until the number of WAL files is less than the set value.
When the total size of MemStore in Region Server reaches 40% of heap memory, Region will block brushing according to the size order of all its MemStore (from large to small). Until the total size of all MemStore in the Region Server decreases below the above value. When the blocking brush is 0.95 times the previous parameter, the client can continue to write.
When the size of a MemStore reaches 128m, all MemStore of the Region in which it is located will block writing.
The flush of MemStore will also be triggered when the automatic writing time is reached. The interval for automatic refresh defaults to 1 hour.
5.2 StoreFile Compaction
Because MemStore generates a new HFile each time it is brushed, and different versions (timestamp) and different types (Put/Delete) of the same field may be distributed in different HFile, you need to traverse all the HFile when querying. In order to reduce the number of HFile and clean up expired and deleted data, StoreFile Compaction is performed.
There are two kinds of Compaction, Minor Compaction and Major Compaction.
Minor Compaction merges several nearby smaller HFile into a larger HFile, but does not clean up expired and deleted data.
Major Compaction merges all HFile under a Store into one large HFile and cleans up expired and deleted data.
5.3 Region Split
Each Table initially has only one Region, and the Region is split automatically as the data is written. When you first split, the two sub-Region are located in the current Region Server, but for the sake of load balancing, HMaster may transfer a Region to another Region Server.
Region Split timing:
Prior to version 0.94:
When the total size of all the StoreFile under a Store in a Region exceeds the hbase.hregion.max.filesize (the default is 10G), the Region is split.
After version 0.94:
When the total size of all the StoreFile under a Store in a Region exceeds the Min (R ^ 2 * "hbase.hregion.memstore.flush.size=128M", hbase.hregion.max.filesize "), the Region is split, where R is the number of Table in the current Region Server.
For example:
The threshold for the first time is 128, and the result is 64, 64 after segmentation.
The second threshold is 512M _ 64512 ⇒ 54 + 256
Will eventually form a 64m... Such a 10G Region queue will cause the problem of data skew.
Solution: make a plan for the Region group in advance, such as 0-1kmai 1kmai 2kmai 2kMur3k.
It is not recommended to use multiple column families, such as CF1,CF2,CF3, but there is a lot of CF1 data and little CF2 and CF3 data, so when region sharding is triggered, CF2 and CF3 will be divided into several small parts, which is not conducive to system maintenance.
6 Common interview questions for HBase
6.1Design principles of RowKey in Hbase
RowKey length principle
The maximum length of binary stream RowKey is 64Kb, which is generally 10-100bytes in practical applications, saved in the form of byte [], and is generally designed to be of fixed length. It is recommended that the shorter the better, because the HFile is a waste of space according to the Key stored in KV.
RowKey hashing principle
RowKey should be designed as much as possible to distribute data evenly across each RegionServer.
The unique principle of RowKey
RowKey must be designed to be unique, and RowKey is sorted and stored in dictionary order, so RowKey can be designed to store frequently read data together.
6.2The position of HBase in big data system
In fact, it is simple to use HBase as a DataBase under big data's system. Any engine that can analyze HBase, such as MR, Hive, Spark and other frameworks connected to HBase can be controlled. For example, you can associate Hive with HBase. Data in Hive is no longer stored by HDFS but stored in HBase. Data added to Hive after association can be seen in HBase, and data added in HBase can also be seen in Hive.
6.3 HBase optimization method
6.3.1 reduce adjustment
Several items in HBase will be adjusted dynamically, such as Region (partition) and HFile. There are a number of ways to reduce these adjustments that will lead to the cost of Igamo.
Region
Without pre-built partitions, Region will split with the increase in the number of Region, which will increase the cost of Region O, so the solution is to pre-build partitions according to your RowKey design to reduce the dynamic splitting of Region.
HFile
MemStore executes flush to generate HFile, and HFilewe also carries out Merge for too many years. In order to reduce such unnecessary IBO overhead, it is recommended to estimate the amount of project data and set an appropriate value for HFile.
6.3.2 reduce start and stop
Database transaction mechanism is to better achieve batch writing, less overhead caused by database opening and closing, then there are also problems caused by frequent opening and closing in HBase.
Close Compaction.
Automatic Minor Compaction and Major Compaction in HBase will bring great cost to compaction. In order to avoid this uncontrolled accident, it is recommended to turn off automatic Compaction and do it in your spare time.
6.3.3 reduce data volume
Turn on filtering to improve query speed
Enabling BloomFilter,BloomFilter is column family-level filtering. When a StoreFile is generated, a MetaBlock is generated to filter data when querying.
Use Compression
Snappy and LZO compression are generally recommended
6.3.4 rational design
The design of RowKey and ColumnFamily in HBase tables is very important. Good design can improve performance and ensure the accuracy of data.
RowKey design
Hashing: hashing ensures the same similar RowKey aggregation, while different RowKey is dispersed, which is conducive to query.
Brevity: RowKey is stored in HFile as part of key, and if the rowKey is designed to be too long for readability, it will increase the storage pressure.
Uniqueness: rowKey must be clearly differentiated.
Operational nature: specific analysis of specific circumstances.
The design of column families
Advantage: the data in HBase is stored by column, so when querying a column of a family, you don't need to scan it all, you only need to scan a family of columns, which reduces the reading of Imax O.
Disadvantages: multi-column family means that this Region has multiple Store, and one Store has one MemStore. When MemStore performs flush, all MemStore in the Store belonging to the same Region will flush, increasing the cost of flush.
6.4 the difference between HBase and relational database
Indicator traditional relational database HBase data type has rich data type string data operation rich operation, complex join table query simple CRUD storage mode based on row storage based on column storage data index complex multiple indexes only RowKey index data maintenance new coverage old multi-version scalability difficult to achieve scale-out performance dynamic scalability
6.5 HBase bulk Import
Write data in batches through HBase API.
Use the Sqoop tool to batch derive to the HBase cluster.
Bulk import using MapReduce.
The HBase BulkLoad way.
HBase imports data through Hive associations.
The writing efficiency of HBase API and MapReduce for big data import is very low, because it requests RegionServer to write data. During this period, the data will be first written to WAL and MemStore,MemStore. After reaching the threshold, the data will be written to disk to generate HFile files. If there are too many HFile files, Compaction will occur. If the Region size is too large, Split will also occur.
BulkLoad is suitable for initial data import, and HBase and Hadoop are in the same cluster. BulkLoad uses MapReduce to generate HFile format files directly, and then Region Servers moves the HFile files to the corresponding Region directory.
At this point, the study on "what is the difference between HBase and relational database" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.