Hive storage processor (StorageHandlers) and integration of hive and hbase 07/19 Update SLTechnology News&Howtos

Hive storage processor (StorageHandlers) and integration of hive and hbase

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article is based on the translation of the official English document of hive. Some things that are difficult to understand have been added to my personal understanding. The official English address is:

1. Https://cwiki.apache.org/confluence/display/Hive/StorageHandlers

2. Https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Introduction of a storage processor

With the HIVE storage processor, you can not only make hive based on hbase, but also support cassandra JDBC MongoDB and Google Spreadsheets

The implementation principle of HIVE memory is based on the scalability of HIVE and Hadoop.

Input formatting (input formats)

Output formatting (output formats)

Serialization / deserialization package (serialization/deserialization librarises)

In addition to the above extensibility, the storage processor also needs to implement a new metadata hook interface that allows the use of HIVE's DDL statements to define and manage hive's own metadata and other system directories (this directory is personally understood as other system metadata directories)

Some terms:

The concepts of HIVE itself:

The managed table is the internal table (managed): the metadata is managed by hive, and the data is also stored in the hive system

External table (external table): the definition of the table is managed by an external metadata directory, and the data is also stored in the external system

The concept of hive storage processor:

Local (native) tables: tables that hive can directly manage and access without the help of storage processors

Non-native tables: tables that need to be managed and accessed through a storage processor

The internal table, the external table and the non-local surface form an intersection, resulting in the following four forms of conceptual definitions:

Managed local tables (managed native): tables created through CREATE TABLE

External local table (external native): created with CREATE EXTERNAL TABLE, but without STORED BY clause

Managed non-local table () managed non-native: created by CREATE TABLE with a STORED BY clause, hive stores table definitions in metadata, but does not create any files (my understanding is the directory file where the data is stored, and hive creates a corresponding directory to store the corresponding data after defining the table structure). The hive storage processor makes a request to the system that stores the data to create a consistent object structure

External non-local (external non-native): created by CREATE EXTERNAL TABLE with STORED BY clause; hive registers the definition information of the table in its own metadata and checks whether the information registered in hive is consistent with the information originally defined in other systems by calling the storage processor

Create the DDL statement of the table structure through hive:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment],...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [col_comment], col_name data_type [COMMENT col_comment],...)]

[CLUSTERED BY (col_name, col_name,...) [SORTED BY (col_name,...)] INTO num_buckets BUCKETS]

[

[ROW FORMAT row_format] [STORED AS file_format]

| | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] |

]

[LOCATION hdfs_path]

[AS select_statement]

Specific use examples will be described in detail below

Second, the integration of Hive and Hbase

The integration of hive and Hbase is achieved through the Hbase handler jar package in the form of hive-hbase-x.y.z.jar, a processor that relies on versions above hadoop0.20 and has only been tested on hadoop-0.20.x hbase-0.92.0 and zookeeper-3.3.4. If the hbase version is not 0.92, you need to recompile the corresponding hive storage processor based on the version you are using.

To create a table of hbase that can be managed by hive, add the statement STORED BY after the ddl statement CREATE TABLE of hive

CREATE TABLE hbase_table_1 (key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ": key,cf1:val")

TBLPROPERTIES ("hbase.table.name" = "xyz")

The Hbase.table.name attribute is optional, which is used to specify the name of this table in hbase, that is, to allow the same table to have different names in hive and hbase. In the above statement, the table in hive is called hbase_talbe_1, and in hbase, the table is called xyz. If not specified, the two names will be the same.

When the command is executed, you can see a new empty table in the shell of HBase, as follows:

$hbase shell

HBase Shell; enter 'help' for list of supported commands.

Version: 0.20.3, r902334, Mon Jan 25 13:13:08 PST 2010

Hbase (main): 001VR 0 > list

Xyz

1 row (s) in 0.0530 seconds

Hbase (main): 002 describe 0 > "xyz"

DESCRIPTION ENABLED

{NAME = > 'xyz', FAMILIES = > [{NAME = >' cf1', COMPRESSION = > 'NONE', VE true

RSIONS = > '3Qing, TTL = >' 21474836474th, BLOCKSIZE = > '655364th, IN_MEMORY = >

'false', BLOCKCACHE = >' true'}]}

1 row (s) in 0.0220 seconds

Hbase (main): 003 scan 0 > "xyz"

ROW COLUMN+CELL

0 row (s) in 0.0060 seconds

There is no information for the column val in the above output, although this column name is specified in the build table statement. This is because in hbase, only the names of column families are defined in table-level metadata, and columns in column families are defined only in row-level metadata.

The following statement is a table that defines how to load data from hive into HBase. The table pokes is a table that already exists in hive and has data in it.

INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98

Then, verify in Hbase shell that the data has been loaded:

Hbase (main): 009 scan 0 > "xyz"

ROW COLUMN+CELL

98 column=cf1:val, timestamp=1267737987733, value=val_98

1 row (s) in 0.0110 seconds

The result of querying the table through the statement of hive is as follows:

Hive > select * from hbase_table_1

Total MapReduce jobs = 1

Launching Job 1 out of 1

...

98 val_98

Time taken: 4.582 seconds

Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:

Set hive.hbase.wal.enabled=false

Warning: disabling WAL may lead to data loss if an HBase failure occurs, so only use this if you have some other recovery strategy available.

If you want to create an hbase accessible table based on an existing hive table, you need to use CREATE EXTERNAL TABLE, as follows:

CREATE EXTERNAL TABLE hbase_table_2 (key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")

TBLPROPERTIES ("hbase.table.name" = "some_existing_table")

The hbase.columns.mapping property item must exist, and it represents the column family of the hbase table and the column definition information.

Column matching of hive and hbase

There are two types of SERDEPROPERTIES configuration information to control the mapping of HBase columns to HIve

1.hbase.columns,mapping

2.hbase.table.default.storage.type

This property can have values of both String and binary types. The default is String;. This option is only available in this version of hive0.9, and in earlier versions, only the String type is available.

The current column matching is a bit bulky and suffocating to use:

1. For each hive column, you need to specify a corresponding entry in the parameter hbase.columns.mapping (for example: akey or key is called an entry). Entries between multiple columns are separated by commas; that is, if a table has n columns, there are n comma-separated entries in the value of the parameter hbase.columns.mapping. For example:

"hbase.columns.mapping" = ": key,a:b,a:c,d:e" means that there are two column families, one is a, and the other is dline a, which has two columns, b and c, respectively

Note that spaces are not allowed in the value of hbase.columns.mapping

two。 Each matching entry must be in the form of key or column family name: [column name] [# (binary | string)] (those preceded by a # logo were added in hive0.9, and all were treated as String types in earlier versions)

1) if no type is specified for columns or column families, the value of hbase.table.default.storage.type will be treated as the class type of those columns or column families

2) any prefix that represents the value of a type (binary | string) can be used to represent this type, for example, # b represents # binary

3) if the specified column type is binary bytes, the storage type of the cell in HBase must also be binary bytes

3. Must jump to a: key form match entry (composite key form is not supported)

4. Before the hive0.6 version, the first entry was used as the key field. After 0.6, it needs to be specified directly by: key.

5. Currently, there is no way to access the timing properties of HBase, and queries always access the latest data (this is mainly because the hbase cell data store is versioned, according to time summary).

6. Because the column definition of HBase does not contain data type information, all other types are converted to string representatives when stored; therefore, the column data type does not support customization

7. It is not necessary to map all column families of HBase, but column families that are not mapped cannot be read to data by accessing Hive tables; multiple Hive tables can be mapped to tables of the same HBase

The following sections provide more detailed examples of current different column mappings

Multiple columns and column families

The following example contains the columns of three Hive tables and the column families of two HBase tables, one of which contains two columns

CREATE TABLE hbase_table_1 (key int, value1 string, value2 int, value3 int)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = ": key,a:b,a:c,d:e"

);

INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2

FROM pokes WHERE foo=98 OR foo=100

In the above example, hive has three columns except value1 value2 value3, and the corresponding HBase table is divided into two column families, an and d, in which an includes two columns (b and c); the corresponding values correspond from left to right, and the b column in the a column corresponds to the value1,c column corresponding to the e column of the value2,d column family corresponds to value3.

The following is the result viewed from HBase:

Hbase (main): 014 hbase_table_1 0 > describe "hbase_table_1"

DESCRIPTION ENABLED

{NAME = > 'hbase_table_1', FAMILIES = > [{NAME = >' true, COMPRESSION = >'N true

ONE', VERSIONS = > '313, TTL = >' 2147483647, BLOCKSIZE = > '655366, IN_M

EMORY = > 'false', BLOCKCACHE = >' true'}, {NAME = > 'dumped, COMPRESSION = >

'NONE', VERSIONS = >' 319, TTL = > '2147483647, BLOCKSIZE = >' 65536, IN

_ MEMORY = > 'false', BLOCKCACHE = >' true'}]}

1 row (s) in 0.0170 seconds

Hbase (main): 015 scan 0 > "hbase_table_1"

ROW COLUMN+CELL

100 column=a:b, timestamp=1267740457648, value=val_100

100 column=a:c, timestamp=1267740457648, value=101

100 column=d:e, timestamp=1267740457648, value=102

98 column=a:b, timestamp=1267740457648, value=val_98

98 column=a:c, timestamp=1267740457648, value=99

98 column=d:e, timestamp=1267740457648, value=100

2 row (s) in 0.0240 seconds

The query result of the same table in Hive is:

Hive > select * from hbase_table_1

Total MapReduce jobs = 1

Launching Job 1 out of 1

...

100 val_100 101 102

98 val_98 99 100

Time taken: 4.054 seconds

Mapping between Hive MAP (set) and Hbase column Family

The following is an example of mapping Hive's MAP data type to the Hbase column family. Each row can contain a different combination of columns, the column name can correspond to the map, and the values corresponds to the column value.

CREATE TABLE hbase_table_1 (value map, row_key int)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = "cf:,:key"

);

INSERT OVERWRITE TABLE hbase_table_1 SELECT map (bar, foo), foo FROM pokes

WHERE foo=98 OR foo=100

Where cf represents the column family, and is empty after the colon, which corresponds to the column value of the Hive table, that is, the key of value is the column of the cf column family. You can understand the use of key as the column name and value as the value through the query of HBase below.

Hbase (main): 012 scan 0 > "hbase_table_1"

ROW COLUMN+CELL

100 column=cf:val_100, timestamp=1267739509194, value=100

98 column=cf:val_98, timestamp=1267739509194, value=98

2 row (s) in 0.0080 seconds

Cf is the column family name, and val_100 is the key,100 of MAP (collection) in the hive table is the value of val_100 in MAP (collection)

The following is the query display result of the corresponding hive:

Hive > select * from hbase_table_1

Total MapReduce jobs = 1

Launching Job 1 out of 1

...

{"val_100": 100} 100

{"val_98": 98} 98

Time taken: 3.808 seconds

Note: the key of MAP (collection) must be of type string, otherwise it will fail because key is the name of the HBase column; the following definition will fail

CREATE TABLE hbase_table_1 (key int, value map)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = ": key,cf:"

);

FAILED: Error in metadata: java.lang.RuntimeException: MetaException (message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map but is mapped to map)

Note: when there is a value such as ": key,cf:" in hbase.columns.mapping, that is, if the column family colon is empty, it means that the corresponding type in Hive is set map. If not, the creation of the table will fail.

CREATE TABLE hbase_table_1 (key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = ": key,cf:"

);

FAILED: Error in metadata: java.lang.RuntimeException: MetaException (message:org.apache.hadoop.hive.serde2.SerDeException

Org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map but is mapped to string)

Enumerate the usage of value types in hbase.columns.mapping

If no type of value is specified, such as cf:val, the data type is configured with the value of hbase.table.default.storage.type

1. When hbase.table.default.storage.type is not configured, it defaults to string. If there is a binary data type, it is defined as follows:

CREATE TABLE hbase_table_1 (key int, value string, foobar double)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = ": key#b,cf:val,cf:foo#b"

);

two。 If the specified hbase.table.default.storage.type displayed is binary, and if the type has a string type, it needs to be specified, such as:

CREATE TABLE hbase_table_1 (key int, value string, foobar double)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = ": key,cf:val#s,cf:foo"

"hbase.table.default.storage.type" = "binary"

);

The # s in cf:val#s means that the type is string,cf:foo and there is no configuration type, then the configuration of hbase.table.default.storage.type is adopted and the configuration is binary.

Add a timestamp

When adding a record to the HBase table with hive, the timestamp defaults to the current time. If you want to change this value, you can set the optional configuration item hbase.put.timestamp of the SERDEPROPERIES property. When this configuration item is-1, it is the default policy, that is, to add the record as the current timestamp.

CREATE TABLE hbase_table_1 (key int, value string, foobar double)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = ": key,cf:val#s,cf:foo"

"hbase.table.default.storage.type" = "binary"

"hbase.put.timestamp" = "2013-03-17 09:04:30"

)

Uniqueness of primary key

There is a slight difference between the tables of HBase and Hive, that is, the table of HBase has a primary key and needs to be unique, but Hive does not; if it is not guaranteed that this primary key is unique, then HBase can only store one of them, which will result in that when querying, hive can always find the correct value, while the result of HBase is uncertain.

The query in hive is as follows:

CREATE TABLE pokes2 (foo INT, bar STRING)

INSERT OVERWRITE TABLE pokes2 SELECT * FROM pokes

-- this will return 3

SELECT COUNT (1) FROM POKES WHERE foo=498

-- this will also return 3

SELECT COUNT (1) FROM pokes2 WHERE foo=498

When the same table is queried in HBase, there will be an uncertain result:

CREATE TABLE pokes3 (foo INT, bar STRING)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = ": key,cf:bar"

);

INSERT OVERWRITE TABLE pokes3 SELECT * FROM pokes

-- this will return 1 instead of 3

SELECT COUNT (1) FROM pokes3 WHERE foo=498

Overlay (Overwrite)

When OVERWRITE is executed, records that already exist in the HBase will not be deleted; however, if the existing records are the same as the primary key (key) of the new records, the old data will be overwritten

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.