Why choose Hive? 07/12 Update SLTechnology News&Howtos

Why choose Hive?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail why you chose Hive. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Why choose Hive?

Big data's Computing / Extensibility based on Hadoop

Support for SQL like query language

Unified metadata management

Simple programming

Installation of Hive

1.1 the role of a data warehouse in the hadoop ecosystem. He can manage data in hadoop and query data in hadoop at the same time.

In essence, hive is a SQL parsing engine. Hive can convert SQL queries to job in MapReduce to run.

Hive has a set of mapping tools that can convert SQL to job in MapReduce, tables and fields in SQL to files (folders) in HDFS, and columns in files.

This set of mapping tools is called metastore and is generally stored in derby and mysql.

The default location of hive in hdfs is / user/hive/warehouse, which is determined by the attribute hive.metastore.warehouse.dir in the configuration file hive-conf.xml.

Installation of 2.hive

(1) decompress, rename and set environment variables

(2) under the directory $HIVE_HOME/conf/, execute the command mv hive-default.xml.template hive-site.xml to rename

Under the directory $HIVE_HOME/conf/, execute the command mv hive-env.sh.template hive-env.sh rename

(3) modify the configuration file hadoop-env.sh of hadoop, as follows:

Export HADOOP_CLASSPATH=.:$CLASSPATH:$HADOOP_CLASSPATH:$HADOOP_HOME/bin

(4) under the directory $HIVE_HOME/bin, modify the file hive-config.sh by adding the following:

Export JAVA_HOME=/usr/local/jdk

Export HIVE_HOME=/usr/local/hive

Export HADOOP_HOME=/usr/local/hadoop

3. Install mysql

(1) remove the information of mysql-related libraries that have been installed on linux. Rpm-e xxxxxxx-- nodeps

Execute the command rpm-qa | grep mysql to check whether the deletion is clean.

(2) execute the command rpm-I mysql-server-* to install the mysql server

(3) start the mysql server and execute the command mysqld_safe &

(4) execute the command rpm-I mysql-client-* to install the mysql client

(5) execute the command mysql_secure_installation to set the root user password

4. Using mysql as the metastore of hive

(1) place the jdbc driver of mysql in the lib directory of hive

(2) modify the hive-site.xml file as follows:

Javax.jdo.option.ConnectionURL jdbc:mysql://hadoop0:3306/hive?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword admin

User interface, including CLI,JDBC/ODBC,WebUI

Metadata storage, usually stored in relational databases such as mysql, derby

Interpreter, compiler, optimizer, executor

Hadoop: use HDFS for storage and MapReduce for calculation

There are three main user interfaces: CLI,JDBC/ODBC and WebUI

.CLI, the Shell command line

.JDBC / ODBC is the Java of Hive, similar to the way you use traditional database JDBC

Web GUI accesses Hive through a browser

Hive stores metadata in a database (metastore). Currently, only mysql and derby are supported. The metadata in Hive includes the name of the table, the columns and partitions of the table and its attributes, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.

The interpreter, compiler and optimizer complete the generation of HQL query statements from lexical analysis, syntax analysis, compilation, optimization and query plan (plan). The generated query plan is stored in HDFS and subsequently executed by a MapReduce call

Hive's data is stored in HDFS, and most queries are done by MapReduce (queries that include *, such as select * from table do not generate MapRedcue tasks)

Metastore of Hive

Metastore is a centralized repository of hive metadata. Metastore uses the embedded derby database as the storage engine by default

Disadvantage of Derby engine: only one session can be opened at a time

Use Mysql as the external storage engine to access multiple users at the same time

Shell of Hive

1. Hive command line mode, enter the executor of # / hive/bin/hive directly, or enter # hive-service cli

2. Startup mode of hive web interface (port number 9999)

# hive-service hwi&

Used to access hive through a browser

Http://hadoop0:9999/hwi/

3. Startup method of hive remote service (port number 10000)

# hive-service hiveserver&

Hive and traditional Database

Query language

HiveQL

SQL

Data storage location

HDFS

Raw Device or local FS

Data format

User defined

System decision

Data update

Not supported

Support

Indexes

The new version is available, but weak

Yes

Execution

MapReduce

Executor

Execution delay

High

Low

Expandability

High

Low

Data scale

Big

Small

Data type of Hive

Basic data type

Tinyint / smalint / int / bigint

Float / double

Boolean

String

Complex data type

Array/Map/Struct

No date / datetime

Data storage of Hive

The data storage of Hive is based on Hadoop HDFS

Hive does not have a special data storage format

The storage structure mainly includes: database, file, table and view.

Hive can load text files (TextFile) directly by default, and also supports sequence file

When creating a table, specify the column and row delimiters of the Hive data, and Hive can parse the data

Data Model of Hive-Internal Table

Similar in concept to Table in the database

Each Table has a corresponding directory to store data in the Hive. For example, a table test whose path in HDFS is: / warehouse/test. Warehouse is defined in hive-site.xml by ${hive.metastore.warehouse.dir}

The directory of the specified data warehouse

All Table data (excluding External Table) is stored in this directory.

When you delete a table, both metadata and data are deleted

Data Model of Hive-Partition Table

Partition corresponds to the dense index of the Partition column of the database

In Hive, a Partition in a table corresponds to a directory under the table, and all Partition data is stored in the corresponding directory

For example, the test table contains two Partition, date and city

‍ corresponds to the HDFS subdirectory ‍ of date=20130201, city = bj:

/ warehouse/test/date=20130201/city=bj

‍ corresponds to date=20130202. The HDFS subdirectory of city=sh is; ‍

/ warehouse/test/date=20130202/city=sh

CREATE TABLE tmp_table # Table name

(

Title string, # Field name Field Type

Minimum_bid double

Quantity bigint

Have_invoice bigint

) COMMENT' comments: XXX' # Table comments

PARTITIONED BY (ptSTRING) # partition table field (if your file is very large, you can use the partition table to quickly filter out the data divided by partition field)

ROW FORMAT DELIMITED

What is the division of the FIELDSTERMINATED BY'\ 001' # field

STOREDAS SEQUENCEFILE; # how to store data? SEQUENCEFILE is the file compression format included with hadoop.

Some related commands

SHOW TABLES; # View all tables

SHOW TABLES'* TMP*'; # supports fuzzy query

SHOWPARTITIONS TMP_TABLE; # check what partitions the table has

DESCRIBE TMP_TABLE; # View table structure

Shell of the partition table

Create a data file partition_table.dat

Create a tabl

Create table partition_table (rectime string,msisdnstring) partitioned by (daytime string,citystring) row format delimited fields terminated by'\ t 'stored as TEXTFILE

Load data into a partition

Load data local inpath'/home/partition_table.dat' into tablepartition _ tablepartition (daytime='2013-02-01)

View data

Select * from partition_table

Select count (*) from partition_table

Delete table drop table partition_table

The data model of Hive-bucket table

Bucket tables hash the data and store them in different files.

Create table create table bucket_table (id string) clustered by (id) into 4 buckets; load data set hive.enforce.bucketing = true

Insert into table bucket_table select name from stu

Insert overwrite table bucket_table select name from stu

When the data is loaded into the bucket table, the hash value of the field is taken and modeled with the number of buckets. Put the data in the corresponding file.

Sampling query

Select * from bucket_table tablesample (bucket 1 out of 4 on id)

Data Model of Hive-external Table

To point to data that already exists in HDFS, you can create a Partition

It is the same as the internal table in the organization of metadata, but the storage of actual data is quite different.

The internal table creation process and the data loading process (both of which can be done in the same statement), in the process of loading data, the actual data is moved to the data

In the warehouse directory; then access to the data pair will be done directly in the data warehouse directory. When you delete a table, the data and metadata in the table will be deleted at the same time

There is only one process for an external table, which loads the data and creates the table at the same time, and does not move to the data warehouse directory, but simply establishes a link with the external data. When deleting an external table, only the link is deleted

CREATEEXTERNAL TABLE page_view (viewTimeINT, useridBIGINT, page_urlSTRING, referrer_urlSTRING, ipSTRING COMMENT'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT' This is the staging page view table' ROW FORMAT DELIMITED FIELDSTERMINATED BY '44' LINES TERMINATED BY' 12 'STORED ASTEXTFILE LOCATION' hdfs://centos:9000/user/data/staging/page_view'

Shell of the external table

Create a data file external_table.dat

Create a tabl

Hive > create external table external_table1 (key string) ROW FORM AT DELIMITED FIELDS TERMINATED BY'\ t 'location' / home/external'

Create a directory / home/external in HDFS

# hadoop fs-put / home/external_table.dat / home/external

Load data

LOAD DATA INPATH'/ home/external_table1.dat' INTO TABLE external_table1

View data

Select * from external_table

Select count (*) from external_table

Delete tabl

Drop table external_table

Import data

When the data is loaded into the table, no conversion is made to the data. The Load operation simply copies / moves the data to the location corresponding to the Hive table.

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]

INTO TABLE tablename

[PARTITION (partcol1=val1, partcol2=val2...)]

Import one Hive table into another established Hive table

INSERT OVERWRITE TABLE tablename [PARTITION (partcol1=val1, partcol2=val2...)] Select_statementFROM from_statement

CTAS

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

(col_namedata_type,...)

AS SELECT...

Example: create table new_external_testas select * from external_table1

Query

Query based on Partition

A general SELECT query is a full table scan. But if it is a partitioned table, the query can take advantage of partition pruning (input pruning), similar to "partitioned index", to scan only the portion of a table that it cares about. Hive's current implementation is that partition pruning is enabled only if the partition assertion (Partitioned by) appears in the WHERE clause closest to the FROM clause. For example, if the page_ views table (partitioned by day) uses a date column partition, the following statement reads only data with partition '2008-03-01'.

SELECT page_views.* FROM page_views WHERE page_views.date > = '2013-03-01' AND page_views.date create table acinfo (name string,acipstring) row format delimited fields terminated by'\ t 'stored as TEXTFILE

Hive > load data local inpath'/home/acinfo/ac.dat' into table acinfo

Internal connection

Select b.name. * from dim_aca join acinfo b on (a.ac=b.acip) limit 10

Left outer connection

Select b. Name. * from dim_ac a left outer join acinfo b on a.ac=b.acip limit 10

Java client

Hive remote service startup # hive-- service hiveserver > / dev/null 2 > / dev/null &

JAVA client related code

Class.forName ("org.apache.hadoop.hive.jdbc.HiveDriver"); Connection con = DriverManager.getConnection ("jdbc:hive://192.168.1.102:10000/wlan_dw", "", ""); Statement stmt = con.createStatement (); String querySQL= "SELECT * FROM wlan_dw.dim_m order by flux desc limit 10"; ResultSet res = stmt.executeQuery (querySQL) While (res.next ()) {System.out.println (res.getString (1) + "\ t" + res.getLong (2) + "\ t" + res.getLong (3) + "\ t" + res.getLong (4) + "\ t" + res.getLong (5));}

UDF

1. The UDF function can be directly applied to the select statement. After formatting the query structure, the content is output.

2. When writing UDF functions, you need to pay attention to the following points:

A) Custom UDF needs to inherit org.apache.hadoop.hive.ql.UDF.

B) the evaluate function needs to be implemented, and the evaluate function supports overloading.

4. Steps

A) package the program on the target machine

B) enter the hive client and add the jar package: hive > add jar/ run/jar/udf_test.jar

C) create a temporary function: hive > CREATE TEMPORARY FUNCTION add_example AS 'hive.udf.Add'

D) query HQL statement:

SELECT add_example (8,9) FROM scores

SELECT add_example (scores.math, scores.art) FROM scores

SELECT add_example (6,7,8,6.8) FROM scores

E) destroy temporary function: hive > DROP TEMPORARY FUNCTION add_example

Note: UDF can only implement one-in-one-out operation. If you need to implement multiple input and output, you need to implement UDAF.

This is the end of the article on "Why choose Hive". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.