What does hive do? 07/16 Update SLTechnology News&Howtos

What does hive do?

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what does hive do". It is easy to understand and well organized. I hope it can help you solve your doubts. Let the editor lead you to study and learn this article "what does hive do?"

What is Hive?

Hive is the basic framework of data warehouse based on hadoop, which is mainly used for analysis and statistics. It is also a Sql parsing engine that converts Sql statements into mr programs to run to get statistics. Most of his syntax follows Sql99, so it is friendly to people with database experience. Instead of writing tedious MR programs, they can get the data they need through sql statements. The table in Hive is the directory on hdfs, with a table name corresponding to a directory name, and the data is stored in the directory as a file. Because hive storage is based on hdfs, it depends on hadoop, and hive can't run without hadoop. It does not have a specific storage format, but it has a storage structure, such as databases, tables, views, and so on.

The system architecture of Hive.

First of all, three kinds of user interfaces, CLI,JDBC/ODBC,WebUI, are provided, so users can operate hive in three forms. When the user visits, the sql statement must be executed, so hive must have a compiler, an interpreter, an optimizer, and an executor. Then the MR task, the execution plan, is generated to access the ResourceManger and NameNode to handle the program. Hive's MetaStore, that is, metadata storage, depends on the database. The default is the native derby database, but this database only supports one session, so you can change it to be stored in MySql by changing the configuration.

The execution process of sql in Hive

First of all, the user submits the sql statement to hive through the interface, then sends it to the driver to parse the statement, then gives it to the compiler to compile the statement, finds the metadata involved in the metastore query, generates the execution plan and returns it to the driver, and the driver gives the plan to the execution engine for execution. If the execution engine finds that this is similar to the ddl statement, it writes metadata to metastore, if the statement involves calculation. Things such as count (*) should be handed over to resourcemanger to execute the mr program. If it is a query field or the like, just go to namenode to get the required data. When the program is executed or the data is obtained, the execution results are exchanged for the execution engine, which gives the user interface.

CLI of three user interfaces of Hive

In fact, it is the command line mode, which executes hive directly, or executes hive-service cli. This opens the hive command line. You can type show tables; or create table statements to verify success. Note that the hive table mentioned above is actually the directory of hdfs, and the path to that directory is specified by your configuration file entry hive.metastore.warehouse.dir. When you start hive, you can add some parameters for multi-function use. For example, if you use hive-d name=xiaoming, it is tantamount to setting a name variable with a value of xiaoming, which can be called in hive ${name}. Or you just want to execute a statement, and you don't want to enter the hive command line, you can hive-e "show tables". Or if you want to write the execution result to a file, then hive-e "show tables" > a.txt. Examples are as follows

$> hive-e "" does not need to enter the hive interface, but can be executed by using the HQL statement in double quotes, omitting the time to open hive. But after the execution, it is still in the linux interface.

$> hive-e "" > aaa is the same as above, but the result of execution is written into aaa.

$> hive-S-e "" > aaa is the same as above, adding a silent mode, but less log prompts, such as execution time, etc., if a large number of operations, you can omit some of the time.

$> hive-f file File is replaced by the location of a file where the hql statement is written, and then executes.

$> hive-i / home/my/hive-init.sql means that after execution, instead of going back to the linux interface, the hive interface is left behind.

Hive > source file is similar to-f, but is executed in the hive interface.

You can also use hive-- hiveconf plus the profile entry you want to specify to use the configuration you specified in the current session, as shown in

Hive-hiveconf hive.cli.print.current.db=true

You can also use the set command from the hive command line to perform the same functions as hiveconf, such as set hive.cli.print.current.db=true

If you find this configuration troublesome and want to apply it automatically as soon as you boot, you can create the file .hiverc (don't forget.) in the user directory of your operating system, where you went with the cd command. The file content is the content of the set command.

JDBC of hive user interface

When using this at that time, pay attention to whether your hive is version 1 or 2, and you can determine whether it is hiveserver or hiveserver2 through the bin under hive.

Before running the code, you need to start the service on the command line, hive-- service hiveserver2, the code is as follows

Package com.shixin.hivejdbc; import java.sql.Connection;import java.sql.DriverManager;import java.sql.ResultSet;import java.sql.SQLException;import java.sql.Statement; public class Hivetest {public static Connection getConnection () {Connection con=null; try {/ / if version 1, org.apache.hadoop.hive.jdbc.HiveDriver Class.forName ("org.apache.hive.jdbc.HiveDriver") / / Note here is hive2 con=DriverManager.getConnection ("jdbc:hive2://115.28.138.100:10000/default", "hadoop", "hadoop");} catch (Exception e) {/ / TODO Auto-generated catch block e.printStackTrace ();} return con } public static void main (String [] args) {Connection con=getConnection (); Statement sta=null; if (contextual null) {try {sta=con.createStatement (); ResultSet rs=sta.executeQuery ("show tables"); while (rs.next ()) {System.out.println (rs.getString (1)) }} catch (SQLException e) {/ / TODO Auto-generated catch block e.printStackTrace ();

Compound data types of hive

The basic data types in hive are roughly the same as those in mysql, mainly with three more compound data types. They are Array,Struct,Map. See the example of how to use it.

Build a table

Create table T8 (

Id int

Info struct

Subject array

Scores map)

Row format delimited

Fields terminated by'\ t'

Collection items terminated by','

Map keys terminated by':'

(the sentence field represents the field delimiter, collection represents the partial delimiter in the composite data type, and map represents the key-value pair delimiter of the map type.)

Import data

Load data local inpath'/ home/hadoop/a.txt' into table T8

Where a.txt data is

1 sx,21 23,45,32,45 english:94.2,math:99

2 shuainan,22 43,21,32 english:95.2,math:96

View data

Select * from T8

T8.id t8.info t8.subject t8.scores

1 {"name": "sx", "age": 21} [23 english 45, 32 english 45] {"english": 94.2, "math": 99.0}

2 {"name": "shuainan", "age": 22} [43 english 21 english 32] {"english": 95.2, "math": 96.0}

If you query separately for a value that matches the data type, the statement is

Select info from t8

Select info.sx from t8

Select info.name from t8

Select subject from t8

Select subject [2] from t8

Select subject [3] from t8

Select scores from t8

Select scores ['english'] from T8

Some ddl statements in 7.hive

Database definition, create database. (note that the default is default database when hive starts)

Table definition, create table... In addition to this basic statement, of course, there is create table T2 like T1; only the table structure of T1 is copied, not the data

A column definition is actually a modification of a column.

Hive > ALTER TABLE T3 CHANGE COLUMN old_name new_name String COMMENT '...' AFTER column2

ALTER TABLE T3 ADD COLUMNS (gender int)

Replace (no columns are deleted, but you can try replace, but it is not recommended, because it is tantamount to re-establishing the table structure, for example, you have two fields, one field Replace, and you only have one field after replace)

8. External table and internal table

The definition of the table above is the definition of the internal table. So how is the external table defined?

Create external table T11 (id int,name string) row format delimited fields terminated by'\ t'

Yes, just add an external, check the corresponding hdfs directory, and add a T11 directory. Then load the data

Load data local inpath'/ home/hadoop/aaa' into table T11

Using select to view, you can see the data, look at hdfs corresponding to the T11 directory, there is an extra aaa file. It all seems to be the same as the internal watch. So what's the difference between them? We use drop table for both internal and external tables, and find that the table catalog on the hdfs of the internal table disappears, while the table catalog data of the external table still exists. Therefore, we know that deleting external tables only deletes the metadata of metastore, not the table data in hdfs (that is, under the warehouse directory).

When the definition statement of the table is like this

Hive > create external table exter_table (

> id int

> name string

> age int

> tel string)

> location'/ home/wyp/external'

That is to say, you add location and specify the storage path of the table. With or without external (that is, internal table or external table), he will create the specified folder in the specified path, instead of using the table name as the directory under warehouse as before. At this point, if you use the load command to import data, the data file will be stored in the path you specified.

9. Partition table

Partition tables classify and group data, similar to groupby, and represent different partitions in different folder forms. Its meaning is to optimize the query. If you use partition fields when querying, it will speed up the query.

Create a partition table

Create table T12 (id int,name string) partitioned by (date string) row format delimited fields terminated by'\ t'

Add a partition

Alter table T12 add partition (data='20160103')

Alter table T12 add partition (date='20160102')

At this point, the data=20160102 and data=20160103 folders are added to the T12 directory.

Load data

Load data local inpath'/ home/hadoop/aaa' into table T12 partition (date='20160103')

10. Views and indexes

Views and indexes can also be created in hive like mysql

Create view v1 AS select t1.name from t1

Create an index

Create index t1_index on table T1 (name) as' org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild in table T1 index table; (as specifies indexer)

Rebuild the index

Alter index t1_index on T1 rebuild; (index needs to be rebuilt when data changes)

Show index

Show formatted index on t1

Delete index

Drop index if exists t1_index on t1

11. The way data is loaded

Load data from a file

Hive > LOAD DATA [LOCAL] INPATH '...' [OVERWRITE] INTO TABLE T2 [PARTITION (province='beijing')]; (take it from the hdfs without adding local. Overwrite means to overwrite, and add partition if it is a partition table)

Load data through a query table

Hive > INSERT OVERWRITE TABLE T2 PARTITION (province='beijing') SELECT * FROM xxx

WHERE xxx (similarly, if you are a partition table and the partition information is added, he will store the data queried by select in the directory corresponding to T2 as a file, and the attributes queried by your select statement should be able to match T2)

Hive > FROM T4

INSERT OVERWRITE TABLE T3 PARTITION (...) SELECT... WHERE...

INSERT OVERWRITE TABLE T3 PARTITION (...) SELECT... WHERE... (upgraded version above)

There is also a dynamic partition loading.

Hive > INSERT OVERWRITE TABLE T3 PARTITION (province='bj', city)

SELECT t.province, t.city FROM temp t WHERE t. Provinceholders

(your city partition field properties should match, and after execution, a directory partition for city=t.city will be created under province=bj. Where the value of province can not be stored, he will create a new one. This is supported when dynamic partitioning is not enabled)

If your sentence goes like this,

Hive > INSERT OVERWRITE TABLE T3 PARTITION (province, city)

Dynamic partitioning support should be enabled at this time.

Hive > set hive.exec.dynamic.partition=true

Hive > set hive.exec.dynamic.partition.mode=nostrict

Hive > set hive.exec.max.dynamic.partitions.pernode=1000

12. Functions in hive

There are functions like mysql in hive, such as max (), min (), and so on. You can also customize functions in hive. You can program the required functions in java code and hit the jar package. The specific steps are as follows

A) package the program on the target machine

B) enter the hive client and add the jar package: hive > add jar/ run/jar/udf_test.jar

C) create a temporary function: hive > CREATE TEMPORARY FUNCTION add_example AS 'hive.udf.Add'

(add_example is the name of the function you want to get.''is your classpath, such as com.shixin.hive.MyUp)

The udf code is as follows (small conversion uppercase function)

Package com.shixin.hive

Import org.apache.hadoop.hive.ql.exec.UDF

Public class MyUp extends UDF {

Public String evaluate (String word) {

If (word==null | | word.equals ("")) {

Return null

} else {

Return word.toUpperCase ()

}

The above is all the contents of this article "what does hive do?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.