Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the basic operations of libraries, tables, fields and interactive queries in Hive

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

Editor to share with you what are the basic operations of libraries, tables, fields and interactive queries in Hive. I hope you will gain something after reading this article. Let's discuss it together.

1. Command line operation

(1) to print the query header, display settings are required:

Set hive.cli.print.header=true

(2) add "- -", followed by comments, but CLI does not parse comments. Files with comments can only be executed in this way:

Hive-f script_name

(3)-e is followed by a quoted hive instruction or query, and-S removes the excess output:

Hive-S-e "select * FROM mytable LIMIT 3" > / tmp/myquery

(4) A query that traverses all partitions will produce a huge MapReduce job if you have a large number of datasets and directories

Therefore, it is recommended that you use the strict model, that is, when you have a partition, you must specify the where statement.

Hive > set hive.mapred.mode=strict

(5) display the database currently in use

Set hive.cli.print.current.db=true

(6) set Hive Job priority

Set mapred.job.priority=VERY_HIGH | HIGH | NORMAL | LOW | VERY_LOW

(VERY_LOW=1,LOW=2500,NORMAL=5000,HIGH=7500,VERY_HIGH=10000)

Set mapred.job.map.capacity=M setting runs a maximum of M map tasks at the same time

Set mapred.job.reduce.capacity=N setting runs a maximum of N reduce tasks at the same time

(7) the number of Mapper in Hive is determined by the following parameters:

Mapred.min.split.size, mapred.max.split.size, dfs.block.size

SplitSize = Math.max (minSize, Math.min (maxSize, blockSize))

The number of map is also related to the number of inputfilles. If there are 2 input files, even if the total size is less than blocksize, 2 map will be generated.

Mapred.reduce.tasks is used to set the number of reduce.

2. Table operation

(1) View all partitions of a table

SHOW PARTITIONS ext_trackflow

Query a specific partition

SHOW PARTITIONS ext_trackflow PARTITION (statDate='20140529')

(2) View the formatted complete table structure

Desc formatted ext_trackflow

DESCRIBE EXTENDED ext_trackflow

(3) Delete partition: both metadata and data of the partition will be deleted, but only metadata will be deleted for extended tables.

ALTER TABLE ext_trackflow DROP PARTITION (statDate='20140529')

(4) whether the query is an external table or an internal table

DESCRIBE EXTENDED tablename

(5) copy table structure

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3

LIKE mydb.employees

LOCATION'/ path/to/data'

Note: if you ignore the keyword EXTERNAL, it depends on whether the employees is external or internal, and if you add it, it must be EXTERNAL and ask for LOCATION

(6) Import data for a partition of the internal table, and Hive will establish a directory and copy the data to the partition

LOAD DATA LOCAL INPATH'${env:HOME} / california-employees'

INTO TABLE employees

PARTITION (country = 'US', state =' CA')

(7) add data to a partition of the external table

ALTER TABLE log_messages ADD IF NOT EXISTS PARTITION (year = 2012, month = 1, day = 2)

LOCATION 'hdfs://master_server/data/log_messages/2012/01/02'

Note:Hive doesn't care about partitions, directories, or data, which leads to no query results.

(8) modify the table: you can modify the table at any time, but you only modify the metadata of the table, and the actual data will not have any impact.

For example, if you change the specified location of the partition, this command will not delete the old data.

ALTER TABLE log_messages PARTITION (year = 2011, month = 12, day = 2)

SET LOCATION's 3nDrexAGUA ourbucketUniGap logsUnix 2011 AGUO1UniGUP 02'

(9) change table properties

ALTER TABLE log_messages SET TBLPROPERTIES (

'notes' = 'The process id is no longer captured; this column is always NULL'

);

(10) change storage properties

ALTER TABLE log_messages

PARTITION (year = 2012, month = 1, day = 1)

SET FILEFORMAT SEQUENCEFILE

Note: if table is a partition, then partition is a must

(11) specify a new SerDe

ALTER TABLE table_using_JSON_storage

SET SERDE 'com.example.JSONSerDe'

WITH SERDEPROPERTIES (

'prop1' = 'value1'

'prop2' = 'value2'

);

Note:SERDEPROPERTIE explains what kind of model SERDE uses. The attribute value and name are strings. It is convenient to tell users to specify SERDE for themselves and what model to apply to.

Set for the current SERDE

ALTER TABLE table_using_JSON_storage

SET SERDEPROPERTIES (

'prop3' = 'value3'

'prop4' = 'value4'

);

(12) change storage attributes

ALTER TABLE stocks

CLUSTERED BY (exchange, symbol)

SORTED BY (symbol)

INTO 48 BUCKETS

(13) complex change table statement: add hook ALTER TABLE for various operations. TOUCH

ALTER TABLE log_messages TOUCH

PARTITION (year = 2012, month = 1, day = 1)

A typical application scenario is that when the partition changes, it will trigger

Hive-e'ALTER TABLE log_messages TOUCH PARTITION (year = 2012, month = 1, day = 1);'

(14) ALTER TABLE ARCHIVE PARTITION captures partition files to Hadoop archive file, which is HAR

ALTER TABLE log_messages ARCHIVE

PARTITION (year = 2012, month = 1, day = 1); (can only be used in partitioned tables)

(15) protect partitions from deletion and query

ALTER TABLE log_messages

PARTITION (year = 2012, month = 1, day = 1) ENABLE NO_DROP

ALTER TABLE log_messages

PARTITION (year = 2012, month = 1, day = 1) ENABLE OFFLINE

Note: DISABLE corresponds to ENABLE and cannot be applied to tables that are not partitioned

(16) display the table according to the regular condition (regular expression)

Hive > SHOW TABLES'. * s'

(17) conversion between external table and internal table

Alter table tablePartition set TBLPROPERTIES ('EXTERNAL'='TRUE'); / / transfer from internal table to external table

Alter table tablePartition set TBLPROPERTIES ('EXTERNAL'='FALSE'); / / external table to internal table

(18) Partition and bucket division:

Partition (partition: save files by directory, one directory for each partition) for example:

CREATE EXTERNAL TABLE table1 (column1 STRING, column2 STRING, column3 STRING,) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE;ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20090105); ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20090102); ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20081231)

Bucket (split buckets, hash for specified columns, one file for each bucket)

CREATE TABLE VT_NEW_DATA (column1 STRING, column2 STRING, column3 STRING,) CLUSTERED BY (column1) SORTED BY (column1) INTO 48 BUCKETSROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'LINES TERMINATED BY'\ n'STORED AS RCFILE;3, column operation

(1) rename columns, change locations, types and comments

ALTER TABLE log_messages

CHANGE COLUMN hms hours_minutes_seconds INT

COMMENT 'The hours, minutes, and seconds part of the timestamp'

AFTER severity

Change the name: new column old column type

Comment is not required, you can add comments

AFTER is used to change the field location

Only the metadata was modified and no changes were made to the source data

(2) add a new column

ALTER TABLE log_messages ADD COLUMNS (

App_name STRING COMMENT 'Application name'

Session_id LONG COMMENT 'The current session id')

(3) delete and replace columns: use them with caution!

ALTER TABLE table_name ADD | REPLACE COLUMNS (col_name data_type [COMMENT col_comment],...)

ADD represents a new field, with the field after all columns (before the partition column)

REPLACE means to replace all fields in the table.

REPLACE COLUMNS removes all existing columns and adds the new set of columns.

REPLACE COLUMNS can also be used to drop columns. For example:

"ALTER TABLE test_change REPLACE COLUMNS (an int, b int);" will remove column `c' from test_change's schema. Note that this does not delete underlying data, it just changes the schema.

(4) REGEX Column Specification

The SELECT statement can use regular expressions to make column selections. The following statement queries all columns except ds and hr:

SELECT `(ds | hr)? +. + `FROM test

4. View variables

Hive > set

...

Hive > set-v

... Even more output!...

'set' outputs all variables under the hivevar,hiveconf,system and env namespaces.

The 'set-v' includes all the variables defined by the output Hadoop.

Hive > set hivevar:foo=hello

Hive > set hivevar:foo

Hivevar:foo=hello

Use variables:

Hive > create table toss1 (I int, ${hivevar:foo} string)

5. A complete example of creating a database and table-- creating a database create database ecdata WITH DBPROPERTIES ('creator' =' June', 'date' =' 2014-06-01');-- or using the COMMENT keyword-- viewing the database description DESCRIBE DATABASE ecdata;DESCRIBE DATABASE EXTENDED ecdata;-- cutting database use ecdata;-- to delete the table drop table ext_trackflow -- create tables create EXTERNAL table IF NOT EXISTS ext_trackflow (cookieId string COMMENT '05dvOVC6Il6INhYABV6LAGROBV), cate1 string COMMENT' 4examples, area1 string COMMENT '102bands, url string COMMENT' http://cd.ooxx.com/jinshan-mingzhan-1020', trackTime string COMMENT '2014-05-25 23VOVOVC6Il6INhYABV6LAGOV6Il6INhYABV6LAGOC6Il6INhYABV6LAGOV36, trackURLMap map COMMENT' {"area": "102" "cate": "4 001'COLLECTION ITEMS TERMINATED BY 29je 14052"}',) PARTITIONED BY (statDate STRING) ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\ 001'COLLECTION ITEMS TERMINATED BY'\ 002'MAP KEYS TERMINATED BY'\ 003'LINES TERMINATED BY'\ n'STORED AS TEXTFILELOCATION'/ DataWarehouse/ods/TrackFlowTable' -- add a partition statement ALTER TABLE ext_trackflow ADD PARTITION (statDate='20140525') LOCATION'/ DataWarehouse/ods/TrackFlowTable/20140525';-- to create a partition yesterday= `date-d' 1 days ago' +'% Y% m% d' `hive-e "use ecdata; ALTER TABLE ext_trackflow ADD PARTITION (statDate='$yesterday') LOCATION'/ DataWarehouse/ods/TrackFlowTable/$yesterday';" 6. Examples of common sentences

(1) Statistics of pv by page type

Select pageType, count (pageType) from ext_trackflow where statDate = '20140521' group by pageType

Note: a general SELECT query scans the entire table and uses the PARTITIONED BY clause to build the table, so the query can take advantage of the input pruning feature of partition pruning.

The current implementation of Hive is that partition pruning is enabled only if the partition assertion appears in the WHERE clause closest to the FROM clause

(2) two ways to export query results to local

INSERT OVERWRITE LOCAL DIRECTORY'/ home/jun06/tmp/110.112.113.115' select area1, count (area1) from ext_trackflow where statDate = '20140521' group by area1 having count (area1) > 1000

Hive-e'use ecdata; select area1, count (area1) from ext_trackflow where statDate = '20140521' group by area1 having count (area1) > 1000th' > a.txt

(3) query and use of map data structure

Select trackURLMap, extField, unitParamMap, queryParamMap from ext_trackflow where statDate = '20140525' and size (unitParamMap)! = 0 limit 10

(4) the following query queries the five sales representatives with the largest sales records.

SET mapred.reduce.tasks = 1

SELECT * FROM test SORT BY amount DESC LIMIT 5

(5) insert data from the same table into different tables and paths:

FROM src

INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key

< 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >

= 100 and src.key

< 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >

= 200 and src.key

< 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >

= 300

(6) insert the file stream directly into the file in streaming mode:

Hive > FROM invites an INSERT OVERWRITE TABLE events SELECT TRANSFORM (a.foo, a.bar) AS (oof, rab) USING'/ bin/cat' WHERE a.ds > '2008-08-09'

(7) Hive only supports equivalent connections (equality joins), external connections (outer joins) and (left semi joins). Hive does not support all non-equivalent connections because they are very difficult to convert to map/reduce tasks

The LEFT,RIGHT and FULL OUTER keywords are used to handle join hollow records

LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS subquery

In join, the logic of each map/reduce task goes like this: reducer caches the records of all tables in the join sequence except the last table, and serializes the results to the file system through the last table

In practice, the largest table should be written at the end.

(8) when querying join, there are several key points to pay attention to.

Only equivalent join is supported

SELECT a.* FROM a JOIN b ON (a.id = b.id)

SELECT a.* FROM a JOIN b

ON (a.id = b.id AND a.department = b.department)

You can join more than 2 tables, for example

SELECT a.val, b.val, c.val FROM a JOIN b

ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

Note: if the join key of multiple tables in join is the same, the join will be converted to a single map/reduce task

(9) LEFT,RIGHT and FULL OUTER

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)

If you want to limit the output of join, you should write filter conditions in the WHERE clause-or in the join clause

The confusing problem is the situation of table partitioning.

SELECT c.val, d.val FROM c LEFT OUTER JOIN d ON (c.key=d.key)

WHERE a.ds='2010-07-07 'AND b.ds='2010-07-07'

If no record for the corresponding c table is found in the d table, all columns of the d table list NULL, including the ds column. That is, join will filter all records in the d table that cannot be found matching the c table join key. In this case, LEFT OUTER makes the query result independent of the WHERE clause

Solution.

SELECT c.val, d.val FROM c LEFT OUTER JOIN d

ON (c.key=d.key AND d.ds='2009-07-07 'AND c.ds='2009-07-07')

(10) LEFT SEMI JOIN

The limitation of LEFT SEMI JOIN is that the table on the right in the JOIN clause can only set filtering conditions in the ON clause, not in the WHERE clause, SELECT clause, or anywhere else.

SELECT a.key, a.value

FROM a

WHERE a.key in

(SELECT b.key

FROM B)

Can be rewritten as:

SELECT a.key, a.val

FROM a LEFT SEMI JOIN b on (a.key = b.key)

(11) habits that should be changed from SQL to HiveQL

① Hive does not support traditional equivalent connections

The inlining of two tables in SQL can be written as:

Select * from dual a dint dual b where a.key = b.key

In Hive should be

Select * from dual a join dual b on a.key = b.key

Instead of the traditional format:

SELECT t1.a1 as c1, t2.b1 as c2FROM t1, t2

WHERE t1.a2 = t2.b2

② semicolon character

A semicolon is the closing tag of a SQL statement, as well as in HiveQL, but in HiveQL, semicolon recognition is less intelligent, for example:

Select concat (key,concat (';', key)) from dual

However, when parsing a statement, HiveQL prompts:

FAILED: Parse Error: line 0 mismatched input''expecting) in function specification

The solution is to escape using the semicolon octal ASCII code, then the above statement should be written as follows:

Select concat (key,concat ('\ 073recording precinct key)) from dual

After reading this article, I believe you have a certain understanding of "what are the basic operations of libraries, tables, fields and interactive queries in Hive". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report