Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Basic Syntax Summary of hive (hql)

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Wednesday, 2019-2-20

Basic Syntax Summary of hive (hql)

-

Hive learns three kinds of table sentences of 3:Hive to explain https://blog.csdn.net/qq_36743482/article/details/78383964 in detail

There are three ways to create tables in Hive:

1. Direct table building method

For example: create table table_name (col_name data_type)

2. Query and table building method

For example: build a table through AS query statement: store the results of the subquery in the new table with data, which is generally used in the intermediate table.

3. Like table building method

A table with exactly the same structure is created, but there is no data. Middle table of common expressions

/ / for more information, please see the link.

Hive file format (four types of table STORE AS): https://blog.csdn.net/hereiskxm/article/details/42171325

The hive file storage format includes the following categories:

1 、 TEXTFILE

2. SEQUENCEFILE / / sequence file

3 、 RCFILE

4. ORCFILE (appears after 0.11)

Summary

TEXTFILE is the default format, which is not specified when creating the table. When importing data, the data file will be copied directly to hdfs without processing.

A table in SEQUENCEFILE,RCFILE,ORCFILE format cannot import data directly from a local file. The data is first imported into a table in textfile format, and then imported into a SequenceFile,RCFile,ORCFile table using insert from the table.

/ / for more information, please see the link.

Summary

1. Textfile default format, no data compression, high disk overhead and high data parsing overhead.

Can be used in combination with Gzip and Bzip2 (automatically checked by the system and decompressed automatically when the query is executed), but in this way, hive will not split the data

As a result, it is impossible to operate the data in parallel.

2. SequenceFile is a kind of binary file support provided by Hadoop API, which is easy to use, divisible and compressible.

SequenceFile supports three compression options: NONE,RECORD,BLOCK. Record compression ratio is low, it is generally recommended to use BLOCK compression.

3. RCFILE is a storage mode that combines row and column storage. First of all, it divides the data into rows to ensure that the same record is on the same block, avoiding the need to read multiple block to read a record. Secondly, block data column storage is beneficial to data compression and fast column access.

4. ORCFILE ()

Summary:

Compared with TEXTFILE and SEQUENCEFILE,RCFILE due to column storage, the performance consumption of data loading is higher, but it has better compression ratio and query response. The data warehouse is characterized by one write and multiple reads, so on the whole, RCFILE has obvious advantages over the other two formats.

Set hive.cli.print.header=true; / / hive client print header

CREATE TABLE page_view (viewTime INT, userid BIGINT

Page_url STRING, referrer_url STRING

Ip STRING COMMENT'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY (dt STRING, country STRING) / / partition

The sentence ROW FORMAT DELIMITED / / means that a line in the file is a record in our table.

FIELDS TERMINATED BY'\ 001' / / We use 'tab' as the delimiter

STORED AS SEQUENCEFILE; / / is stored as a sequence file

/ / sequencefile

Create table tab_ip_seq (id int,name string,ip string,country string)

Row format delimited

Fields terminated by','

Stored as sequencefile

/ / external external table

CREATE EXTERNAL TABLE tab_ip_ext (id int, name string,ip STRING,country STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY','

STORED AS TEXTFILE

LOCATION'/ external/user'

/ / Import data locally into the hive table (essentially uploading files to the hive management directory in hdfs)

Load data local inpath'/ home/hadoop/ip.txt' into table tab_ext

/ / Import data from hdfs into hive table (essentially moving files from the original directory to a directory managed by hive)

Load data inpath 'hdfs://ns1/aa/bb/data.log' into table tab_user

/ / use select statements to insert data in bulk

Insert overwrite table tab_ip_seq select * from tab_ext

/ / create & load

Create table tab_ip (id int,name string,ip string,country string)

Row format delimited

Fields terminated by','

Stored as textfile

/ / CTAS builds the table structure according to the select statement

CREATE TABLE tab_ip_ctas

AS

SELECT id new_id, name new_name, ip new_ip,country new_country FROM tab_ip_ext

SORT BY new_id; / / sort by new_id

/ / CLUSTER / / this part means bucket table

Create table tab_ip_cluster (id int,name string,ip string,country string)

Clustered by (id) into 3 buckets; / / gathered into 3 buckets by (id)

Load data local inpath'/ home/hadoop/ip.txt' overwrite into table tab_ip_cluster

Set hive.enforce.bucketing=true; / / hive execute split bucket

Insert into table tab_ip_cluster select * from tab_ip

Select * from tab_ip_cluster tablesample (bucket 2 out of 3 on id)

/ / PARTITION partition table

Create table tab_ip_part (id int,name string,ip string,country string)

Partitioned by (year string)

Row format delimited

Fields terminated by','

Load data local inpath'/ home/hadoop/data.log' overwrite into table tab_ip_part

Partition (year='1990')

Load data local inpath'/ home/hadoop/data2.log' overwrite into table tab_ip_part

Partition (year='2000')

The above means to insert / home/hadoop/data2.log data into this partition of 2000 under tab_ip_part

Select from tab_ip_part

Select from tab_ip_part where part_flag='part2'

Select count (*) from tab_ip_part where part_flag='part2'

Change command

Alter table tab_ip change id id_alter string

ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location' / external/hive/dt'

Show partitions tab_ip_part; / / View Partition

Detailed explanation of HIVE- bucket table and creation of instance https://www.cnblogs.com/kouryoushine/p/7809299.html

The difference between Hive partition table and bucket table https://blog.csdn.net/jenrey/article/details/80588493

The basis of Hive: partition, bucket, Sort Merge Bucket Join https://blog.csdn.net/wisgood/article/details/17186107

The partition and bucket division of hive is an advanced function, which can be learned in depth (see xinghuan)

/ / insert from select inserts data into other tables in batches through select statements

Create table tab_ip_like like tab_ip

Insert overwrite table tab_ip_like select * from tab_ip

/ / write to hdfs writes the result to the file of hdfs

Insert overwrite local directory'/ home/hadoop/hivetemp/test.txt' select from tab_ip_part where part_flag='part1'

Insert overwrite directory'/ hiveout.txt' select from tab_ip_part where part_flag='part1'

/ / cli shell executes the hql statement of hive through shell

Hive-S-e'select country,count () from tab_ext' > / home/hadoop/hivetemp/e.txt

Select from tab_ext sort by id desc limit 5

Select a.ip.b.book from tab_ext a join tab_ip_book b on (a.name=b.name)

Hive complex format array,map,struct uses: https://blog.csdn.net/u010670689/article/details/72885944

/ / array array

Create table tab_array (an array,b array)

Row format delimited

Fields terminated by'\ t'

Collection items terminated by','

Select a [0] from tab_array

Select * from tab_array where array_contains (bMagneWord`)

Insert into table tab_array select array (0), array (name,ip) from tab_ext t

The use of array https://blog.csdn.net/zhao897426182/article/details/78347960 in Hive

/ / map / / Mapping

Create table tab_map (name string,info map)

Row format delimited

Fields terminated by'\ t'

Collection items terminated by','/ / Collection of items terminated

Map keys terminated by':'; / / Mapping key terminates

Load data local inpath'/ home/hadoop/hivetemp/tab_map.txt' overwrite into table tab_map

Insert into table tab_map select name,map ('name',name,'ip',ip) from tab_ext

/ / struct

Create table tab_struct (name string,info struct)

Row format delimited

Fields terminated by'\ t'

Collection items terminated by','

Load data local inpath'/ home/hadoop/hivetemp/tab_st.txt' overwrite into table tab_struct

Insert into table tab_struct select name,named_struct ('age',id,'tel',name,'addr',country) from tab_ext

/ / the operation and maintenance staff can ignore the UDF function and learn if they are interested.

Select if (id=1,first,no-first), name from tab_ext

Hive > add jar / home/hadoop/myudf.jar

Hive > CREATE TEMPORARY FUNCTION fanyi AS 'cn.itcast.hive.Fanyi'

Select id,name,ip,fanyi (country) from tab_ip_ext

What is the UDF of Hive? https://blog.csdn.net/yqlakers/article/details/70211522

First of all, what is the full name of UDF,UDF, user-defined function, user-defined function, why does it exist? Sometimes the query you want to write can not be easily expressed using the built-in functions provided by Hive. By writing UDF,Hive, you can easily insert user-written processing code and use them in the query, which is equivalent to customizing some functions in HQL (Hive SQL). First, UDF must be written in the java language, and Hive itself is written in java. So want to learn hadoop this distributed framework related technology, proficient in the use of java is the basic skill!

Author: YQlakers

Source: CSDN

Original: https://blog.csdn.net/yqlakers/article/details/70211522

Copyright notice: this article is the original article of the blogger, please attach a link to the blog article to reprint it!

Hql query Advanced / / hive query Syntax

The function and usage of order by,sort by, distribute by and cluster by in hive: https://blog.csdn.net/jthink_/article/details/38903775

MapReduce script

Connect (join)

Internal connection (inner join)

External connection (outer join)

Semi-connected (semi join)

Map connection (map join)

Subquery (sub query)

View (view)

The overall order of the final output can be made through the order by clause provided by Hive. But because Hive is based on Hadoop, to produce such an overall ordered result, Hadoop must be forced to use only one Reduce to complete the processing. The side effect of this approach is to reduce efficiency.

If you don't need the overall order of the final result, you can use the sort by clause to sort it. This sort operation only ensures that the output of each Reduce is ordered. If you want certain lines to be processed by the same Reduce, you can use the distribute clause to do so. For example:

Table student (classNo,stuNo,score) data are as follows:

C01 N0101 82

C01 N0102 59

C02 N0201 81

C01 N0103 65

C03 N0302 92

C02 N0202 82

C02 N0203 79

C03 N0301 56

C03 N0306 72

We hope to output the performance information of each class from low to high according to the score. Execute the following statement:

Select classNo,stuNo,score from student distribute byclassNo sort by score

The output is as follows:

C02 N0203 79

C02 N0201 81

C02 N0202 82

C03 N0301 56

C03 N0306 72

C03 N0302 92

C01 N0102 59

C01 N0103 65

C01 N0101 82

We can see that the grades of all the students in each class are orderly. Because records of the same classNo are distributed to a separate reduce for processing, while sort by ensures that the output of each reduce is orderly.

Note:

To test the effectiveness of the distribute by in the example above, you should first set up enough reduce. For example, if there are three different classNo in the above example, we need to set the number of reduce to at least 3 or more. If the number of reduce set is less than 3, it will cause multiple different classNo to be distributed to the same

A reduce so that you can't produce the output you expect. The setup commands are as follows:

Set mapred.reduce.tasks = 3

MapReduce script

If we need to call an external script, such as Python, in a query statement, we can use the

Transform (transition), map,reduce, etc.

For example, we want to filter out all failed student records and output only passing student performance information.

Create a new Python script file, score_pass.py, with the following contents:

#! / usr/bin/env python

Import sys

For line in sys.stdin:

(classNo,stuNo,score) = line.strip () .split ('\ t')

Ifint (score) > = 60:

Print "% s\ t% s\ t% s"% (classNo,stuNo,score)

Execute the following statement

Add file / home/user/score_pass.py

Select transform (classNo,stuNo,score) using'score_pass.py' as classNo,stuNo,score from student

The output is as follows:

C01 N0101 82

C02 N0201 81

C01 N0103 65

C03 N0302 92

C02 N0202 82

C02 N0203 79

C03 N0306 72

Note:

1) in the above Python scripts, the delimiter can only be a tab (\ t). Similarly, the output delimiter must also be a tab. This is decided by hive itself and cannot be changed. Do not try to use other delimiters, or an error will be reported. You also need to call the strip function to remove the newline character at the end of the line. (or simply use line.split () with no arguments instead.

2) before using the script, register the script file with the add file statement so that hive can distribute it to the Hadoop cluster.

3) Transfom passes the data to the Python script, and the as statement specifies the output column.

Connect (join)

Directly programming MapReduce that uses Hadoop is a time-consuming thing. Hive greatly simplifies this operation.

Internal connection (inner join)

Similar to the internal connection of SQL. Execute the following statement to query each student's number and teacher's name:

Select a.stuNogramme b.teacherName from student a join teacherb on a.classNo = b.classNo

The output is as follows:

N0203 Sun

N0202 Sun

N0201 Sun

N0306 Wang

N0301 Wang

N0302 Wang

N0103 Zhang

N0102 Zhang

N0101 Zhang

Note:

Please refer to the previous article for the contents of the data file.

Do not use syntax like select xx from aa,bb where aa.f=bb.f, which is not supported by hive.

If you need to view the execution plan of hive, you can add explain before the statement, such as:

Explain Select a.stuNogramme b.teacherName from student a jointeacher b on a.classNo = b.classNo

External connection (outer join)

Similar to traditional SQL, Hive provides left outer join,right outer join,full out join.

Semi-connected (semi join)

Hive does not provide in subqueries. At this point you can use leftsemi join to achieve the same function. Execute the following statement:

Select * from teacher left semi join student onstudent.classNo = teacher.classNo

The output is as follows:

C02 Sun

C03 Wang

C01 Zhang

As you can see, C04 Dong does not appear in the query results because C04 does not exist in the table student.

Note:

The fields in the table on the right (student) can only appear in the on clause, not in other places, such as the select clause.

Map connection (map join)

When a table is small enough to be loaded directly into memory, map connections can be used to improve efficiency

For example:

Select / + mapjoin (teacher) / a.stuNojie b.teacherNamefrom student a join teacher b on a.classNo = b.classNo

The above red marks adopt the annotation style of C.

When non-equivalence judgment is used in connection, it is also more suitable for Map connection. The specific reason requires an in-depth understanding of how Hive and MapReduce work.

Subquery (sub query)

Running the following statement returns the highest record for the average score of all classes.

Select max (avgScore) as maScore from

(Select classNo,avg (score) as avgScore from student group byclassNo) a

Output result:

80.66666666666667

The red part of the above statement is a subquery with an alias of a. The returned subquery result is similar to a table and can be continued.

View (view)

Like views in a traditional database, the view of Hive is just a definition, and the view data is not stored in the file system. Again, the view is read-only.

Run the following two commands:

Create view avg_score as

Select classNo,avg (score) as avgScore from student groupby classNo

Select max (avgScore) as maScore From avg_score

You can see that the output is the same as in the previous example.

Hive data type

-basic types

-compound type

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report