Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Hive SQL Summary

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Create a database

Create database if not exists sopdm

Comment 'this is test database'

With dbproperties ('creator'='gxw','date'='2014-11-12')

-- Database key-value pair attribute information

Location'/ my/preferred/directory'

Show all tables

Show tables

Display the description information of the table

Desc [extended,formatted] tablename

Show table statement

Show create table tablename

Delete tabl

Drop table tablename

Create another table from one table, which is equivalent to replication, table structure replication, data not replicated

Create table test3 like test2

Create tables by other table queries

Create table test4 as select name,addr from test5

Stored as textfile

You can view it directly

Stored as sequencefile

Must be viewed with hadoop fs-text

Stored as rcfile

Hive-service rcfilecat path view

Stored as inputformat 'class' (custom)

Load jar package

Shell window add jar path (scope of this shell)

Loaded into a distributed cache for use by each node

Or copy it directly to the lib directory under the hive installation directory

SerDe (hive uses SerDe to read and write rows of tables)

Read and write order:

HDFS file-- > InputFileFormat-- >-- > Deserializer-- > Row object (for use by hive)

Row object-- > Serializer-- >-- > OutputFileFormat-- > HDFS file

Hive comes with RegexSerDe.class regular expressions that match the data of each row

Create table apachelog (

Host STRING

Identity STRING

User STRING

Time STRING

Request STRING

Status STRING

Size STRING

Refer STRING

Agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex=" ([^] *) ([0-9] *) ([0-9] *) ([^] *) ([^] *) "

) stored AS TEXTFILE

Partitioned table (equivalent to a subdirectory of the table)

Create table tablename (name string) partitioned by (key type,...)

Create external table employees (

Name string

Salary float

Subordinates array

Deductions map

Address struct / home/data/result

Hive command line enter list jar

Show which jar packages are available in the current distributed cache (i.e. jar packages loaded by the add jar command)

The command line executes the hql file (similar to hive-f), which is often used to set initialization parameters

Source / home/data/hql/select_hql

Configuration variable

Set val=''

Hql uses the hive variable

${hiveconf:val}

Select * from testtext where name ='${hiveconf:val}'

Env view linux environment variables

HOME=/root

Hql uses linux environment variables

Select'${env:HOME} 'from testtext

1. Inner table data loading (overwrite and into cannot exist at the same time, only one can exist)

(1) load when creating a table

Create table newtable as select col1,col2 from oldtable

(2) specify the data location when creating the table (have ownership of the data under the location, and delete the data when the inner table is deleted)

Create table tablename () location''

(3) Local data loading

Load data local inpath 'localpath' [overwrite] into table tablename

(4) load hdfs data (move the data, move the original data to the location of the table)

Load data inpath 'hdfspath' [overwrite] into table tablename

The hive command line executes the linux shell command, preceded by a!

! ls / home/data

(5) load data through query statements

Insert into table test_m select name,addr from testtext where name = 'wer'

Or

From testtext insert into table test_m select name,addr where name = 'wer'

Or

Select name,addr from testtext where name = 'wer' insert into table test_m

two。 External table data loading

(1) specify the data location when creating the table

Create external table tablename () location''

(2) query insertion, same as the inner table

(3) use the hadoop command to copy data to the specified location (shell execution in hive and shell execution in linux)

3. Partition table data loading

(1) data loading of internal partition table is similar to that of internal table.

(2) the loading mode of the table data of the outer part is similar to that of the appearance.

Note: the path hierarchy of the data storage should be the same as the partition of the table, and the table should add the corresponding partition to find the data.

Load data local inpath 'localpath' [overwrite] into table tablename partition (dt='20140905')

4. Data type correspondence problem

Load data. When field types cannot be converted to each other, the query returns NULL

Select query input, when field types cannot be converted to each other, insert data as NULL (file save is\ N)

Select query input data, field names can be inconsistent, data loading does not check, check when querying

Data export

Export to local, default delimiter ^ A

Insert overwrite local directory'/ home/data3'

Row format delimited fields terminated by'\ t'

Select name,addr from testtext

Export to hdfs. Row format delimited fields terminated by'\ tcodes is not supported. The default delimiter is I.

Insert overwrite directory'/ home/data3'

Select name,addr from testtext

Table attribute operation

1. Modify table name

Alter table table_name rename to new_table_name

two。 Modify column name

Alter table tablename change column C1 c2 int comment 'xxx' after severity

C1 old column, c2 new column, int represents the new column data type

After severity; can place the column after the specified column, or use 'first' to put it in the first place

3. Add columns (default add columns to the end)

Alter table tablename add column (C1 string comment 'xxxx',c2 string comment' yyyy')

4. Modify tblproperties

Alter table test set tblproperties ('comment'='xxxx')

5. Modify the delimiter (partition table is special)

Mode one

Alter table city set serdeproperties ('field.delim'='\ t'); (invalid for raw data of the partition table, valid for new partitions)

Method 2 also uses the latest delimiters for the raw data of the partitioned table

Alter table city partitin (dt='20140908') set serdeproperties ('field.delim'='\ t')

6. Modify location

Alter table city [partition (...)] Set location' hdfs://master:9000/location'

7. Conversion of internal and external tables

Alter table test set tblproperties ('EXTERNAL'='TRUE'); internal table to external table

Alter table test set tblproperties ('EXTERNAL'='FALSE'); external table to internal table

Aggregation operation

1.count count

All fields of count (*) are not all null, and all are null without adding 1.

Count (1) No matter what the record is, add 1 as long as there is this record.

Count (col) column is not empty plus 1

2.sum summation

Sum (value that can be converted to a number) returns bigint

Sum (col) + cast (1 as bigint)

3.avg

Avg (value that can be converted to a number) returns double

The where condition is executed on the map side

Group by is the combination of columns that perform grouping on the reduce side as key

The judgment is performed after the having sentence aggregation operation, and also on the reduce side.

Groupby data skew optimization

Hive.groupby.skewindata=true; (more than one job)

Join operation (ordinary join does not support unequal links)

Optimize the parameter set hive.optimize.skewjoin=true

Sample

Select m.col as col,m.col2 as col2,n.col3 as col3

From

(select col,col2

From test

Where... (executed on map side)

) m

[left outer | right outer | left semi] join

N (right table)

On m.col=n.col

Where condition (executed on reduce side)

LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS subquery

Hive currently does not implement IN/EXISTS subqueries, so you can rewrite your subqueries with LEFT SEMI JOIN. The limitation of LEFT SEMI JOIN is that the table on the right in the JOIN clause can only be in the

Filter conditions are set in the ON clause, but not in the WHERE clause, SELECT clause, or anywhere else.

SELECT a.key, a.value

FROM a

WHERE a.key in

(SELECT b.key

FROM b)

Can be rewritten as:

SELECT a.key, a.val

FROM a LEFT SEMI JOIN b on (a.key = b.key)

Order by global sort, with only one reduce

Distribute by col distributes data to different reduce according to col

Combined with sort by to ensure that each reduce output is orderly

Union all (no weight removal) and union (weight removal)

Hive only supports union all, and aliases are not allowed for subqueries.

Select col from

(select an as col from T1

Union all

Select b as col from t2

) tmp

Request:

1. The field name is the same

two。 Same field type

3. The number of fields is the same

4. Child tables cannot have aliases

5. If you need to query data from the merged table, the merged table must have an alias

Hive creates an index table

Hive > create index user_index on table user (id) > as' org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' > with deferred rebuild > IN TABLE user_index_table; hive > alter index user_index on user rebuild; hive > select * from user_index_table limit 5

After you directly use the hadoop command to copy and delete hive stored data, you need add partition or alter to synchronize the source data information, otherwise the metadata metastore will be queried during operations such as drop table, and an error will be reported if the metastore information is inconsistent with the hdfs information. This can also be regarded as hive's bug, which has not yet been fixed, but it can also be understood that direct manipulation of hdfs data is not recommended in the original intention.

Currently, you can use commands:

MSCK REPAIR TABLE table_name

This command adds the corresponding partition to the data that is not added to the partition. Synchronize source data information metadata.

Recover Partitions

Hive stores a list of partitions for each table in metastore, and if new partitions are added to HDFS, metastore will not notice these partitions unless

ALTER TABLE table_name ADD PARTITION

Of course you can pass.

MSCK REPAIR TABLE table_name

Similar to the following command in the EMR version

ALTER TABLE table_name RECOVER PARTITIONS

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report