Basic operation of Hive 07/19 Update SLTechnology News&Howtos

Basic operation of Hive

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

What is 01.Hive? introduction to 1.Hive

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide SQL-like query functions.

Hive is a SQL parsing engine that translates SQL statements into M _ Job and then executes them in Hadoop.

2. Hive architecture

User interface, including CLI,JDBC/ODBC,WebUI

Metadata storage, usually stored in relational databases such as mysql, derby

Interpreter, compiler, optimizer, executor

Hadoop: use HDFS for storage and MapReduce for calculation

The metadata of Ps:Hive is not stored on hdfs, but in the database (metastore). Currently, only MySQL and derby are supported. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.

Metadata is the data that describes the data, while Hive's data is stored in Hadoop HDFS

The data is still the original text data, but now there is a catalog plan.

3. The relationship between Hive and Hadoop

Hive uses HDFS to store data and MapReduce to query data.

4. Hive installation and deployment

Hive is just a tool and does not require cluster configuration.

Export HIVE_HOME=/usr/local/hive-2.0.1

Export PATH=PATH:PATH:HIVE_HOME/bin

Configure MySql. If it is not configured, the derby database is used by default, but it is not easy to use. A metastore_db will be created wherever the. / hive command is executed.

MySQL can be installed on one of the nodes.

5. Hive's thrift service

It can be installed on one node and published as a standard service, using the beeline method on other nodes.

Startup mode (if on master):

Start the front desk service: bin/hiveserver2

Launch as background: nohup bin/hiveserver2 1 > / var/log/hiveserver.log 2 > / var/log/hiveserver.err &

Connection method:

Enter hive/bin/beeline and enter the command interface of beeline

Enter the command to connect to hiveserver2

Beeline >! connect jdbc:hive2://master:10000

Beeline >! connect jdbc:hive2://localhost:10000

(master is the hostname of the machine started by hiveserver2, and the port is 10000 by default)

The basic operation of 02.Hive 1. Create a database

Hive > create database tabletest

When you create a new database, a tabletest.db folder is generated in / user/hive/warehouse/ of HDFS.

If you do not create a new database, do not use hive > use, the system default database. You can explicitly use hive > use default; default / user/hive/warehouse/ to build tables

two。 Create a tabl

Syntax:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment],...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment],...)] [CLUSTERED BY (col_name, col_name,...) [SORTED BY (col_name [ASC | DESC],...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]

Example:

Create table t_order (id int,name string,rongliang string,price double) row format delimited fields terminated by'\ t'

A t _ order table is created, and the TBLS table adds table information and column information in the metadata of Mysql, as follows:

At the same time, a t_order folder is added to the tabletest.db folder in HDFS. All Table data (excluding External Table) is stored in this directory.

3. Import data

You can use HDFS to upload files directly to the t_order folder, or you can use Hive's load command.

Load data local inpath'/ home/hadoop/ip.txt' [OVERWRITE] into table tab_ext

The function is the same as uploading a local Linux file to the HDFS system, but it should be noted that if the path behind the inpath is the HDFS path, it will be deleted and cut to the target folder, not good!

4. External table

The EXTERNAL keyword allows users to create an external table and specify a path to the actual data (LOCATION) while creating the table. When creating an internal table, Hive moves the data to the path pointed to by the data warehouse; if an external table is created, only the path where the data is located is recorded, and no change is made to the location of the data.

To avoid the problem of losing the source file, you can create an external table, and the data source can be anywhere.

CREATE EXTERNAL TABLE tab_ip_ext (id int, name string, ip STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY', 'STORED AS TEXTFILE LOCATION' / external/hive'

When you create the table, you specify the path to the HDFS file, so the source file can be uploaded to the / external/hive/ folder.

When the external table is deleted, only the metadata information is deleted, and the data stored in the HDFS is not deleted.

5. Partition table (partition table)

Function: if the file is very large, the partition table can be used to quickly filter out the data divided by partition field.

There are two partitions in t_order, part1 and part2.

The actual thing is to create two more folders under the t_order folder in HDFS, each with the names part1 and part2.

Create table t_order (id int,name string,rongliang string,price double) partitioned by (part_flag string) row format delimited fields terminated by'\ t'

Insert data:

Load data local inpath'/ home/hadoop/ip.txt' overwrite into table t_orderpartition (part_flag='part1')

The ip.txt file is uploaded to the / t_order/part_flag='part1'/ folder.

View the data in the partition table:

Select * from t_order where part_flag='part1'

When querying, the partition field is also displayed, but there is no such field in the actual table. It is a pseudo field.

The partition in hive is to create one more directory, which has the advantages of convenient statistics, higher efficiency, and shrinking the dataset.

Related commands:

SHOW TABLES; # View all tables

SHOW TABLES 'TMP'; # supports fuzzy queries

SHOW PARTITIONS TMP_TABLE; # check what partitions the table has

DESCRIBE TMP_TABLE; # View table structure

6. Bucket table

The bucket in Hive = the partition in MapReduce, while the partition in Hive simply divides the data into different folders.

1. Create a split bucket table create table stu_buck (Sno int,Sname string,Sex string,Sage int,Sdept string) clustered by (Sno) sorted by (Sno DESC) into 4 bucketsrow format delimitedfields terminated by','

Meaning: each bucket is divided into buckets according to the Sno field, and each bucket is partially ordered according to the Sno field, with 4 buckets. The data will not be processed when the bucket is built, but the inserted data is required to be divided into buckets.

two。 Insert data into the bucket table

Generally speaking, it is not applicable for load data to enter the bucket table, because the data entered by load will not be automatically divided into four small files according to the bucket rule. Therefore, it is common to use select queries to insert files into the bucket table.

Set the variable, set the bucket to true, and set the number of reduce to the number of barrels.

Set hive.enforce.bucketing = true;set mapreduce.job.reduces=4;insert overwrite table student_buckselect * from student cluster by (Sno); insert overwrite table stu_buckselect * from student distribute by (Sno) sort by (Sno asc)

Where distribute by (sno) sort by (sno asc) can be used instead of cluster by (Sno), which is equivalent. Cluster by (Sno) = Bucket + sort

Distribute it first, then sort it locally. The difference is that distribute is more flexible and can be partitioned by one field and sorted by another field.

The second subquery outputs four files as input to the main query.

3. The principle and function of bucket meter

Principle:

A Hive is an organization that conducts buckets for a column. Hive uses a hash of column values, and then divides the remainder by the number of buckets to determine which bucket the record is stored in. (the principle is the same as MapReduce's getPartition method)

Function:

(1) the biggest function is to improve the efficiency of join operation.

The premise is that both are bucket tables and the number of barrels is the same or multiples?

Think about this question:

Select a. ID. Name. B. addr from a join b on a.id = b.id

If tables an and b are already bucket tables, and the bucket field is the id field

When doing this join operation, do you still need to do the Cartesian product of the whole table?

For JOIN operations, two tables have the same column, if a bucket operation is performed on both tables. Then the bucket that holds the same column values can be operated by JOIN, which can greatly reduce the amount of data in JOIN.

(2) sampling is more efficient. When dealing with a large data set, it will bring a lot of convenience if the query can be run on a small part of the data in the stage of developing and modifying the query.

7. Insert statement

Insert of Hive is too slow.

But it can be batch insert. You actually want to append files to the folder.

Create table tab_ip_like like tab_ip;insert overwrite table tab_ip_likeselect * from tab_ip

Append files to tab_ip_like

8. Several ways to save the results of select query

1. Save the query results to a new hive table

Create table t_tmp

Select * from tweep

2. Save the query results to an existing hive table

Insert into table t_tmp

Select * from tweep

3. Save the query results to the specified file directory (either locally or HDFS)

Insert overwrite local directory'/ home/hadoop/test'

Select * from tweep

Insert HDFS

Insert overwrite directory'/ aaa/test'

Select * from tweep

9. View, delete

Syntax:

SELECT [ALL | DISTINCT] select_expr, select_expr,... FROM table_ referenced [where where_condition] [GROUP BY col_list [HAVING condition]] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY | ORDER BY col_list]] [LIMIT number]

Note:

CLUSTER BY field meaning: partition according to this field, you need to pay attention to setting the number of reduce_num.

Order by sorts the input globally, so there is only one reducer, which results in a longer computing time when the input is large.

Sort by is not a global sort, it sorts the data before it enters the reducer. Therefore, if you sort with sort by and set mapred.reduce.tasks > 1, sort by only guarantees that the output of each reducer is ordered, not globally.

Distribute by (field) divides the data into different reducer according to the specified field, and the distribution algorithm is a hash hash.

Cluster by (field) sorts the field in addition to the function of Distribute by.

So, if the bucket and sort fields are the same, then cluster by = distribute by + sort by

Select * from inner_tableselect count (*) from inner_table

When you delete a table, both metadata and data are deleted

Drop table inner_table10. Join in Hive

LEFT JOIN,RIGHT JOIN, FULL OUTER JOIN, inner join, left semi join

Prepare data

1,a

2,b

3,c

4,d

7,y

8,u

2,bb

3,cc

7,yy

9,pp

Build a table:

Create table a (id int,name string) row format delimited fields terminated by','; create table b (id int,name string) row format delimited fields terminated by','

Import data:

Load data local inpath'/ home/hadoop/a.txt' into table a * * load data local inpath'/ home/hadoop/b.txt' into table b * * from an inner join b on a.id=b.id

+-+

| | a.id | a.name | b.id | b.name | |

+-+

| | 2 | b | 2 | bb |

| | 3 | c | 3 | cc |

| | 7 | y | 7 | yy |

+-+

Is to ask for intersection.

2. Inner joinselect * from a left join b on a.id=b.id

+-+

| | a.id | a.name | b.id | b.name | |

+-+

| | 1 | a | NULL | NULL |

| | 2 | b | 2 | bb |

| | 3 | c | 3 | cc |

| | 4 | d | NULL | NULL | |

| | 7 | y | 7 | yy |

| | 8 | u | NULL | NULL | |

+-+

No empty connection was found on the left.

3. Right joinselect * from a right join b on a.idaudb.idterter4. Full outer joinselect * from a full outer join b on a.id=b.id

+-+

| | a.id | a.name | b.id | b.name | |

+-+

| | 1 | a | NULL | NULL |

| | 2 | b | 2 | bb |

| | 3 | c | 3 | cc |

| | 4 | d | NULL | NULL | |

| | 7 | y | 7 | yy |

| | 8 | u | NULL | NULL | |

| | NULL | NULL | 9 | pp |

+-+

Data on both sides show.

5. Left semi joinselect * from a left semi join b on a.id = b.id

+-+

| | a.id | a.name |

+-+

| | 2 | b | |

| | 3 | c |

| | 7 | y |

+-+

It is more efficient to return only to the left half, that is, the thing of a.

11. Create a temporary table

Intermediate results can be stored.

CREATE TABLE tab_ip_ctas

SELECT id new_id, name new_name, ip new_ip,country new_country

FROM tab_ip_ext

SORT BY new_id

03. Custom functions and Transform of Hive

When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined functions (UDF:user-defined function) at this time.

1. Custom function category

UDF acts on a single row of data, producing a row of data as output. (mathematical function, string function)

UDAF (user-defined aggregation function): receives multiple input data rows and produces one output data row. (count,max)

2.UDF development example

1. First develop a Java class that inherits UDF and overloads the evaluate method

Public class ToLowerCase extends UDF {/ / must be public public String evaluate (String field) {String result = field.toLowerCase (); return result;}}

2. Upload it to the server in a jar package

3. Add the jar package to hive's classpath

Hive > add JAR / home/hadoop/udf.jar

4. Create a temporary function to associate with the developed Java class

Hive > create temporary function tolowercase as' cn.itcast.bigdata.udf.ToProvince'

5. You can use the custom function strip in hql

Implementation of select id,tolowercase (name) from tweepter3.Transform

The TRANSFORM keyword of Hive provides the ability to call self-written scripts in SQL

It is suitable for situations where you do not want to write UDF and implement functions that are not available in Hive.

1. Load the rating.json file into rat_json, an original table of hive.

/ / {"movie": "1721", "rate": "3", "timeStamp": "965440048", "uid": "5114"} create table rat_json (line string) row format delimited;load data local inpath'/ opt/rating.json' into table rat_json

2. You need to parse the json data into four fields, and insert a new table t_rating (four fields are required to store the processed data)

Create table t_rating (movieid string,rate int,timestring string,uid string) row format delimited fields terminated by'\ t inserting overwrite table t_ratingselect get_json_object (line,'$.movie') as moive,get_json_object (line,'$.rate') as rate from rat_json;//get_json_object is a built-in jason function, which can also be implemented by a custom UDF function

3. Use transform+Python to convert unixtime to weekday

(1) Edit a python script file first

Python Code:

Vi daytime mapper.pythonimport sysimport datetimefor line in sys.stdin: line = line.strip () movieid, rating, unixtime,userid = line.split ('\ t') weekday = datetime.datetime.fromtimestamp (float (unixtime)). Isoweekday () print'\ t'.join ([movieid, rating, str (weekday), userid])

(2) add the file to the classpath of hive:

Hive > add FILE / home/hadoop/weekday_mapper.py;hive > create TABLE u_data_new asSELECT TRANSFORM (movieid, rate, timestring,uid) USING 'python weekday_mapper.py' AS (movieid, rate, weekday,uid) FROM tweeters select distinct (weekday) from u_data_new limit 10

Insert the new data processed by the query into the u_data_new file.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.