How to build HBase, a component of Hadoop ecology 07/02 Update SLTechnology News&Howtos

How to build HBase, a component of Hadoop ecology

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to build HBase, the component of Hadoop ecology". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to build HBase, the component of Hadoop ecology".

1. The Origin of distributed Table system

The distributed table system provides a table model. Each table consists of many rows, which are uniquely identified by the primary key, and each row contains many columns. The whole table is globally ordered in the system. Google's BigTable is the ancestor of the distributed table system, which adopts a two-tier structure. The underlying layer uses GFS as the persistence layer storage layer. However, the external interface of BigTable is not rich. So Google has developed Megastore and Spanner. It not only provides interface, but also can handle transactions.

Each row in the table is uniquely identified by a primary key (Row Key). Each row also contains many columns (Column). A column of a row forms a Cell, and each cell contains multiple versions of data. Overall, BigTable is a distributed multi-dimensional mapping table.

(row:string,column:string,timestamp:int64)-> string

Multiple columns can form a column family (Column Family), so the column name is composed of column family (Column Family) + column name (qualifier). Column families are access control units, so column families are pre-defined, while columns can be added after any number.

The storage structure is shown in the figure (logical view):

The storage structure is shown in the figure (physical view):

The row primary key is an arbitrary string, but the size cannot exceed 64KB. Sort by primary key, which is in dictionary order. So the URL is close enough to be arranged in one piece.

HBase in 2.Hadoop

HBase open source copycat version of BigTable. Based on Hadoop distributed file system HDFS. Then what's the difference between HDFS and HBase?

HDFS is a distributed file system that is suitable for saving large files. Officials claim that it is not a general-purpose file system and does not provide quick access to individual records of documents. HBase, on the other hand, is based on HDFS and provides quick lookup (and update) of records with large tables. This can sometimes lead to conceptual confusion. HBase internally puts the data into an indexed "StoreFiles" for high-speed query. The storage file is located in HDFS. PS: to put it bluntly, HBASE is a collection of indexes.

Comparison of HBase and relational database storage structure (HBase is a kind of NOSQL--not only sql): locating the value in HBase requires several coordinates: rowkey, column family, column, version number. The example I compare here is not appropriate. It is better for Hbase to store some semi-structured data.

Mysql:

Idnamepasswordtype1zhangsanxxxxx

2lisi324324sb23zhaowu423423db

Hbase:

Rowkey (primary key)

Info (column family)

Others (column Family) namepasswordtypeage

Rk0000999

Zhangsan,version1

Zhangsan3,version2

Xxxxxxxxsb10

3. Building HBase in the Experimental Environment

Premise: a distributed file system, such as HDFS.

3.1 Machine preparation and node allocation

There are three machines here, one of which is HMaster, and the other has three HRegionServer. So there are two roles when there is a machine. You also need to deploy HDFS and Zookeeper. My HDFS also has three nodes, one machine has both NameNode and DataNode,SecondaryNameNode, and the other two machines only have datanode.

IPHostname

JVM Pro ‍ cess

192.168.237.201Spark-0x64-001H MasterMagazine HRegionServer MagneQuorumPeerMain DataNode

NameNode,SecondaryNameNode192.168.237.202Spark-0x64-002

HRegionServer,QurumPeerMain,DataNode

192.168.237.203Spark-0x64-003HRegionServiceQurumPeerMainJournal DataNode

3.2 install HDFS in non-HA mode

Because the HDFS distributed cluster has been installed before, but it is not HA mechanism. I will not repeat it here.

3. 3 install Zookeeper of the cluster

I won't repeat it. The previous blog has been written. Http://my.oschina.net/codeWatching/blog/367309

3.4 configure files for HBase

a. Modify the file hbase-evn.sh. Add two places.

Export JAVA_HOME=/spark/app/jdk1.7.0_21/

Export HBASE_MANAGES_ZK=false

b. Modify the file hbase-site.xml

Hbase.rootdir hdfs://Spark-0x64-001:9000/hbase hbase.cluster.distributed true hbase.master Spark-0x64-001VOL6000 hbase.zookeeper.quorum Spark-0x64-001Magnum Sparkly0x64-002ParkMurray 0x64-003

c. Copy the core-site.xml and hdfs-site.xml from hadoop to the conf directory of hbase

d. Edit the regionservers file

Spark-0x64-001Spark-0x64-002Spark-0x64-003

Copy the configured Hbase to another machine. Scp-r command.

3.6 configure the time server. See link:.

3.7 start HBase

a. Start zookeeperb on three machines. Start HDFS. Has started c. Start Hbase (master Master)-> sh bin/start-hbase.sh.d. Optionally, start multiple HMaster---- > hbase-daemon.sh start master

3.8 validate the Web page

System Architecture Analysis of http://192.168.237.201:600104.HBase

Technical architecture diagram of HBase: at a glance

Client1 contains interfaces to access hbase, and client maintains some cache to speed up access to hbase, such as regione location information. Zookeeper1 ensures that there is only one master2 in the cluster at any one time to store the addressing entries for all Region. 3 monitor the status of Region Server in real time, and notify the online and offline information of Region server to the schema of Master4 storing Hbase in real time, including which table and which column family Master1 of each table assign region2 to Region server responsible for region server load balancer 3 discover failed region server and redistribute garbage file collection on region4 GFS on it 5 process schema update request Region Server1 Region server maintain region assigned by Master Processing IO requests for these region 2 Region server is responsible for shredding the region that becomes too large during operation, you can see that the process of client accessing data on hbase does not require the participation of master (addressing access zookeeper and region server, data read and write access regione server), master only maintains the metadata information of table and region, and the load is very low.

HBase's process of finding data

How the system finds the regionhbase where a row key (or a row key range) is located uses a three-tier structure similar to a B+ tree to hold the region location. The first layer is to save the files in zookeeper (Google chubby), which holds the location of root region. The second layer of root region is .meta. The first region of the table holds the location of the other region of the .META.z table. Through root region, we can visit .meta. The data of the table. .META. Is the third layer, which is a special table that holds the region location information of all data tables in hbase. PS:root region will never be split, ensuring that the most need for three jumps, can be located to any region. 2.META. Each row of the table holds the location information of one region, and the row key is encoded with the table name plus the last part of the table. In order to speed up the visit,. META. All region of the table is stored in memory. Suppose,. META. One row of the table takes up approximately 1KB in memory. And each region is limited to 128MB. Then the number of region that can be saved in the above three-tier structure is: (128MB/1KB) * (128MB/1KB) = = 2 (34) region4 client will save and cache the queried location information, and the cache will not expire actively. Therefore, if all the caches on the client expire, you need to go back and forth six times to locate the correct region (three of which are used to detect cache failure, and the other three are used to obtain location information). Thank you for reading, the above is the content of "how to build HBase, the component of Hadoop ecology". After the study of this article, I believe you have a deeper understanding of how to build HBase, the component of Hadoop ecology, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.