In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
Editor to share with you what are the basic operations of libraries, tables, fields and interactive queries in Hive. I hope you will gain something after reading this article. Let's discuss it together.
1. Command line operation
(1) to print the query header, display settings are required:
Set hive.cli.print.header=true
(2) add "- -", followed by comments, but CLI does not parse comments. Files with comments can only be executed in this way:
Hive-f script_name
(3)-e is followed by a quoted hive instruction or query, and-S removes the excess output:
Hive-S-e "select * FROM mytable LIMIT 3" > / tmp/myquery
(4) A query that traverses all partitions will produce a huge MapReduce job if you have a large number of datasets and directories
Therefore, it is recommended that you use the strict model, that is, when you have a partition, you must specify the where statement.
Hive > set hive.mapred.mode=strict
(5) display the database currently in use
Set hive.cli.print.current.db=true
(6) set Hive Job priority
Set mapred.job.priority=VERY_HIGH | HIGH | NORMAL | LOW | VERY_LOW
(VERY_LOW=1,LOW=2500,NORMAL=5000,HIGH=7500,VERY_HIGH=10000)
Set mapred.job.map.capacity=M setting runs a maximum of M map tasks at the same time
Set mapred.job.reduce.capacity=N setting runs a maximum of N reduce tasks at the same time
(7) the number of Mapper in Hive is determined by the following parameters:
Mapred.min.split.size, mapred.max.split.size, dfs.block.size
SplitSize = Math.max (minSize, Math.min (maxSize, blockSize))
The number of map is also related to the number of inputfilles. If there are 2 input files, even if the total size is less than blocksize, 2 map will be generated.
Mapred.reduce.tasks is used to set the number of reduce.
2. Table operation
(1) View all partitions of a table
SHOW PARTITIONS ext_trackflow
Query a specific partition
SHOW PARTITIONS ext_trackflow PARTITION (statDate='20140529')
(2) View the formatted complete table structure
Desc formatted ext_trackflow
DESCRIBE EXTENDED ext_trackflow
(3) Delete partition: both metadata and data of the partition will be deleted, but only metadata will be deleted for extended tables.
ALTER TABLE ext_trackflow DROP PARTITION (statDate='20140529')
(4) whether the query is an external table or an internal table
DESCRIBE EXTENDED tablename
(5) copy table structure
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION'/ path/to/data'
Note: if you ignore the keyword EXTERNAL, it depends on whether the employees is external or internal, and if you add it, it must be EXTERNAL and ask for LOCATION
(6) Import data for a partition of the internal table, and Hive will establish a directory and copy the data to the partition
LOAD DATA LOCAL INPATH'${env:HOME} / california-employees'
INTO TABLE employees
PARTITION (country = 'US', state =' CA')
(7) add data to a partition of the external table
ALTER TABLE log_messages ADD IF NOT EXISTS PARTITION (year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02'
Note:Hive doesn't care about partitions, directories, or data, which leads to no query results.
(8) modify the table: you can modify the table at any time, but you only modify the metadata of the table, and the actual data will not have any impact.
For example, if you change the specified location of the partition, this command will not delete the old data.
ALTER TABLE log_messages PARTITION (year = 2011, month = 12, day = 2)
SET LOCATION's 3nDrexAGUA ourbucketUniGap logsUnix 2011 AGUO1UniGUP 02'
(9) change table properties
ALTER TABLE log_messages SET TBLPROPERTIES (
'notes' = 'The process id is no longer captured; this column is always NULL'
);
(10) change storage properties
ALTER TABLE log_messages
PARTITION (year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE
Note: if table is a partition, then partition is a must
(11) specify a new SerDe
ALTER TABLE table_using_JSON_storage
SET SERDE 'com.example.JSONSerDe'
WITH SERDEPROPERTIES (
'prop1' = 'value1'
'prop2' = 'value2'
);
Note:SERDEPROPERTIE explains what kind of model SERDE uses. The attribute value and name are strings. It is convenient to tell users to specify SERDE for themselves and what model to apply to.
Set for the current SERDE
ALTER TABLE table_using_JSON_storage
SET SERDEPROPERTIES (
'prop3' = 'value3'
'prop4' = 'value4'
);
(12) change storage attributes
ALTER TABLE stocks
CLUSTERED BY (exchange, symbol)
SORTED BY (symbol)
INTO 48 BUCKETS
(13) complex change table statement: add hook ALTER TABLE for various operations. TOUCH
ALTER TABLE log_messages TOUCH
PARTITION (year = 2012, month = 1, day = 1)
A typical application scenario is that when the partition changes, it will trigger
Hive-e'ALTER TABLE log_messages TOUCH PARTITION (year = 2012, month = 1, day = 1);'
(14) ALTER TABLE ARCHIVE PARTITION captures partition files to Hadoop archive file, which is HAR
ALTER TABLE log_messages ARCHIVE
PARTITION (year = 2012, month = 1, day = 1); (can only be used in partitioned tables)
(15) protect partitions from deletion and query
ALTER TABLE log_messages
PARTITION (year = 2012, month = 1, day = 1) ENABLE NO_DROP
ALTER TABLE log_messages
PARTITION (year = 2012, month = 1, day = 1) ENABLE OFFLINE
Note: DISABLE corresponds to ENABLE and cannot be applied to tables that are not partitioned
(16) display the table according to the regular condition (regular expression)
Hive > SHOW TABLES'. * s'
(17) conversion between external table and internal table
Alter table tablePartition set TBLPROPERTIES ('EXTERNAL'='TRUE'); / / transfer from internal table to external table
Alter table tablePartition set TBLPROPERTIES ('EXTERNAL'='FALSE'); / / external table to internal table
(18) Partition and bucket division:
Partition (partition: save files by directory, one directory for each partition) for example:
CREATE EXTERNAL TABLE table1 (column1 STRING, column2 STRING, column3 STRING,) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE;ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20090105); ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20090102); ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20081231)
Bucket (split buckets, hash for specified columns, one file for each bucket)
CREATE TABLE VT_NEW_DATA (column1 STRING, column2 STRING, column3 STRING,) CLUSTERED BY (column1) SORTED BY (column1) INTO 48 BUCKETSROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'LINES TERMINATED BY'\ n'STORED AS RCFILE;3, column operation
(1) rename columns, change locations, types and comments
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity
Change the name: new column old column type
Comment is not required, you can add comments
AFTER is used to change the field location
Only the metadata was modified and no changes were made to the source data
(2) add a new column
ALTER TABLE log_messages ADD COLUMNS (
App_name STRING COMMENT 'Application name'
Session_id LONG COMMENT 'The current session id')
(3) delete and replace columns: use them with caution!
ALTER TABLE table_name ADD | REPLACE COLUMNS (col_name data_type [COMMENT col_comment],...)
ADD represents a new field, with the field after all columns (before the partition column)
REPLACE means to replace all fields in the table.
REPLACE COLUMNS removes all existing columns and adds the new set of columns.
REPLACE COLUMNS can also be used to drop columns. For example:
"ALTER TABLE test_change REPLACE COLUMNS (an int, b int);" will remove column `c' from test_change's schema. Note that this does not delete underlying data, it just changes the schema.
(4) REGEX Column Specification
The SELECT statement can use regular expressions to make column selections. The following statement queries all columns except ds and hr:
SELECT `(ds | hr)? +. + `FROM test
4. View variables
Hive > set
...
Hive > set-v
... Even more output!...
'set' outputs all variables under the hivevar,hiveconf,system and env namespaces.
The 'set-v' includes all the variables defined by the output Hadoop.
Hive > set hivevar:foo=hello
Hive > set hivevar:foo
Hivevar:foo=hello
Use variables:
Hive > create table toss1 (I int, ${hivevar:foo} string)
5. A complete example of creating a database and table-- creating a database create database ecdata WITH DBPROPERTIES ('creator' =' June', 'date' =' 2014-06-01');-- or using the COMMENT keyword-- viewing the database description DESCRIBE DATABASE ecdata;DESCRIBE DATABASE EXTENDED ecdata;-- cutting database use ecdata;-- to delete the table drop table ext_trackflow -- create tables create EXTERNAL table IF NOT EXISTS ext_trackflow (cookieId string COMMENT '05dvOVC6Il6INhYABV6LAGROBV), cate1 string COMMENT' 4examples, area1 string COMMENT '102bands, url string COMMENT' http://cd.ooxx.com/jinshan-mingzhan-1020', trackTime string COMMENT '2014-05-25 23VOVOVC6Il6INhYABV6LAGOV6Il6INhYABV6LAGOC6Il6INhYABV6LAGOV36, trackURLMap map COMMENT' {"area": "102" "cate": "4 001'COLLECTION ITEMS TERMINATED BY 29je 14052"}',) PARTITIONED BY (statDate STRING) ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\ 001'COLLECTION ITEMS TERMINATED BY'\ 002'MAP KEYS TERMINATED BY'\ 003'LINES TERMINATED BY'\ n'STORED AS TEXTFILELOCATION'/ DataWarehouse/ods/TrackFlowTable' -- add a partition statement ALTER TABLE ext_trackflow ADD PARTITION (statDate='20140525') LOCATION'/ DataWarehouse/ods/TrackFlowTable/20140525';-- to create a partition yesterday= `date-d' 1 days ago' +'% Y% m% d' `hive-e "use ecdata; ALTER TABLE ext_trackflow ADD PARTITION (statDate='$yesterday') LOCATION'/ DataWarehouse/ods/TrackFlowTable/$yesterday';" 6. Examples of common sentences
(1) Statistics of pv by page type
Select pageType, count (pageType) from ext_trackflow where statDate = '20140521' group by pageType
Note: a general SELECT query scans the entire table and uses the PARTITIONED BY clause to build the table, so the query can take advantage of the input pruning feature of partition pruning.
The current implementation of Hive is that partition pruning is enabled only if the partition assertion appears in the WHERE clause closest to the FROM clause
(2) two ways to export query results to local
INSERT OVERWRITE LOCAL DIRECTORY'/ home/jun06/tmp/110.112.113.115' select area1, count (area1) from ext_trackflow where statDate = '20140521' group by area1 having count (area1) > 1000
Hive-e'use ecdata; select area1, count (area1) from ext_trackflow where statDate = '20140521' group by area1 having count (area1) > 1000th' > a.txt
(3) query and use of map data structure
Select trackURLMap, extField, unitParamMap, queryParamMap from ext_trackflow where statDate = '20140525' and size (unitParamMap)! = 0 limit 10
(4) the following query queries the five sales representatives with the largest sales records.
SET mapred.reduce.tasks = 1
SELECT * FROM test SORT BY amount DESC LIMIT 5
(5) insert data from the same table into different tables and paths:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key
< 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key
< 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key
< 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300
(6) insert the file stream directly into the file in streaming mode:
Hive > FROM invites an INSERT OVERWRITE TABLE events SELECT TRANSFORM (a.foo, a.bar) AS (oof, rab) USING'/ bin/cat' WHERE a.ds > '2008-08-09'
(7) Hive only supports equivalent connections (equality joins), external connections (outer joins) and (left semi joins). Hive does not support all non-equivalent connections because they are very difficult to convert to map/reduce tasks
The LEFT,RIGHT and FULL OUTER keywords are used to handle join hollow records
LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS subquery
In join, the logic of each map/reduce task goes like this: reducer caches the records of all tables in the join sequence except the last table, and serializes the results to the file system through the last table
In practice, the largest table should be written at the end.
(8) when querying join, there are several key points to pay attention to.
Only equivalent join is supported
SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a JOIN b
ON (a.id = b.id AND a.department = b.department)
You can join more than 2 tables, for example
SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
Note: if the join key of multiple tables in join is the same, the join will be converted to a single map/reduce task
(9) LEFT,RIGHT and FULL OUTER
SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
If you want to limit the output of join, you should write filter conditions in the WHERE clause-or in the join clause
The confusing problem is the situation of table partitioning.
SELECT c.val, d.val FROM c LEFT OUTER JOIN d ON (c.key=d.key)
WHERE a.ds='2010-07-07 'AND b.ds='2010-07-07'
If no record for the corresponding c table is found in the d table, all columns of the d table list NULL, including the ds column. That is, join will filter all records in the d table that cannot be found matching the c table join key. In this case, LEFT OUTER makes the query result independent of the WHERE clause
Solution.
SELECT c.val, d.val FROM c LEFT OUTER JOIN d
ON (c.key=d.key AND d.ds='2009-07-07 'AND c.ds='2009-07-07')
(10) LEFT SEMI JOIN
The limitation of LEFT SEMI JOIN is that the table on the right in the JOIN clause can only set filtering conditions in the ON clause, not in the WHERE clause, SELECT clause, or anywhere else.
SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B)
Can be rewritten as:
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)
(11) habits that should be changed from SQL to HiveQL
① Hive does not support traditional equivalent connections
The inlining of two tables in SQL can be written as:
Select * from dual a dint dual b where a.key = b.key
In Hive should be
Select * from dual a join dual b on a.key = b.key
Instead of the traditional format:
SELECT t1.a1 as c1, t2.b1 as c2FROM t1, t2
WHERE t1.a2 = t2.b2
② semicolon character
A semicolon is the closing tag of a SQL statement, as well as in HiveQL, but in HiveQL, semicolon recognition is less intelligent, for example:
Select concat (key,concat (';', key)) from dual
However, when parsing a statement, HiveQL prompts:
FAILED: Parse Error: line 0 mismatched input''expecting) in function specification
The solution is to escape using the semicolon octal ASCII code, then the above statement should be written as follows:
Select concat (key,concat ('\ 073recording precinct key)) from dual
After reading this article, I believe you have a certain understanding of "what are the basic operations of libraries, tables, fields and interactive queries in Hive". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.