Commonly used HiveQL 04/25 Update SLTechnology News&Howtos

Commonly used HiveQL

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Display the database of the operation in front of the command line prompt:

Hive > set hive.cli.print.current.db=true

2. Query result display field name:

Hive > set hive.cli.print.header=true

3. Execute the shell command:

Hive >! Ls

4. Execute the dfs command:

Hive > dfs-ls

5. In the hive script,-- represents a comment and cannot be used on the command line

6. Data type conversion:

Hive > cast (s as int)

7. Display the details of the table:

Hive > desribe formatted/extended tablename

8. Set Hive to "strict" mode, so that if the partition table is queried and the where clause is not partitioned, the task will be prohibited from being submitted.

Hive > set hive.mapred.mode=strict/nostrict

9. How to write dynamic partition insert data:

Hive > insert overwrite table table1 partition (a from test2 b) select..., c <...

10. Rand (): returns a random value of 0: 1

11. Learn how Hive converts queries into MapReduce tasks:

Hive > explain select sum (a) from test

Hive > explain extended select sum (a) from test

12. Specify that the log is at DEBUG level and output to the console

Hive-hiveconf hive.root.logger=DEBUG,console

13. Discover and describe functions:

Hive > show functions

Hive > describe function concat

Hive > describe function extended concat

14. Standard function: UDF (one to one)

Aggregate function: UDAF (many to one)

Table generation function: UNTF (one to many)

15. Macro commands (sometimes more convenient than functions)

Hive > create temporary macro sigmoid (x double) 1.0 / (1.0 + exp (- x))

Hive > select sigmoid (2) from src limit 1

16. About users, groups, and roles in Hive

Hive (default) > set hive.security.authorization.enabled

Hive.security.authorization.enabled=false

Hive (default) > set system:user.name

System:user.name=hadoop

Hive (default) > show grant user hadoop

Default hadoop USER ALL false 1522203964000 hive

Test hadoop USER ALL false 1522205200000 hive

Zb_dwa hadoop USER ALL false 1522203974000 hive

Zb_dwd hadoop USER ALL false 1522203972000 hive

Zb_src hadoop USER ALL false 1522203972000 hive

Zb_src_test hadoop USER ALL false 1522203972000 hive

Zb_dwa t_code hadoop USER ALL true 1524211395000 hadoop

Zb_dwa t_code_source hadoop USER ALL true 1524204335000 hadoop

Time taken: 2.214 seconds, Fetched: 18 row (s)

Hive (default) > show grant user hadoop on database default

Default hadoop USER ALL false 1522203964000 hive

Time taken: 0.026 seconds, Fetched: 1 row (s)

17. Under what circumstances can Hive avoid MapReduce?

Hive > set hive.exec.mode.local.auto=true

Queries in Hive for some cases do not have to use MapReduce, which is the so-called local schema. For example: SELECT * FROM employees; in this case, Hive can simply read the files in the storage directory corresponding to employees, and then output the formatted content to the console. In the case where the filter condition in the WHERE statement is only a partition field, there is no need to use the MapReduce procedure.

18. JOIN optimization

□ when JOIN joins three or more tables, only one MapReduce job is generated if each ON clause uses the same join key.

□ Hive also assumes that the last table in the query is the largest table. When joining each row of records, it tries to cache the other tables and then scans the last table for calculation. Therefore, the user needs to ensure that the size of the table in the continuous query increases in turn from left to right.

□ Hive also provides a "tag" mechanism to explicitly tell the optimizer which table is a large table, as follows:

Hive > SELECT / * + STREAMTABLE (s) * / s.ymdrect s.symbolbolrect s.priceclosed FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE d.dividend FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s.symbolbole = 'AAPL'

19 、 map-side JOIN

If only one of all tables is a small table, you can put the small table completely in memory when the largest table passes through mapper. Hive can perform the join process (called map-side JOIN) on the map side because the Hive can match the small tables in memory one by one, thus omitting the reduce process required for regular join operations. Even for a small amount of data, this optimization is significantly faster than regular join operations. It not only reduces the reduce process, but also sometimes reduces the execution steps of the map process.

In versions prior to Hive v0.7, if you wanted to use this optimization, you needed to add a tag to the query statement to trigger it.

Hive > SELECT / * + MAPJOIN (d) * / s.ymddictions.symbolbolrecents.priceclosed FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE d.dividend FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s.symbolbol= 'AAPL'

This way of tagging has been abandoned since Hive v0.7, but it is still valid to add this tag. The user needs to configure the property hive.auto.convert.join to be true, which is false by default. The user can also configure the size of the small table that can use this optimization, with the configuration property hive.mapjoin.smalltable.filesize and the default value 25000000.

Hive > set hive.auto.convert.join = true

Hive > set hive.mapjoin.smalltable.filesize = 25000000

Hive > SELECT s. Ymd FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s. Symbolboldres. Priceblocks close.dividend FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s.symbol = 'AAPL'.

Hive does not support this optimization for right outer connections and full outer connections.

If the data in all tables is bucket-by-bucket, this optimization can also be used for large tables in certain cases. To put it simply, the data in the table must be divided into buckets according to the keys in the ON statement, and the number of buckets in one table must be several times that of the other table. When these conditions are met, the Hive can connect according to bucket data during the map phase. However, this optimization is also not enabled by default. You need to set the parameter hive.optimize.bucketmapjoin to true and default to false.

Hive > set hive.optimize.bucketmapjoin = true

20, ORDER BY and SORT BY

ORDER BY in Hive performs a global sort of results. In other words, there will be a process in which all data is processed through a reducer. For large datasets, this process can take too long. If the value of the attribute hive.mapred.mode is strict, then Hive requires that such statements must be restricted by a LIMIT statement. By default, the value of this property is nostrict.

SORT BY only sorts the data in each reducer, which is the process of performing a local sort. This ensures that the output data of each reducer is ordered (but not globally). This improves the efficiency of subsequent global sorting.

If the number of reducer used is greater than 1, the order of the output will be very different. Since the output of each reducer is only guaranteed to be locally sorted, then the output of different reducer may overlap.

21. DISTRIBUTE BY with SORT BY

DISTRIBUTE BY controls how the output of map is divided in reducer (dividing the data into different reducer according to the specified fields). All data transferred in MapReduce job is organized in key-value pairs, so when Hive converts a user's query statement to MapReduce job, it must use this feature internally.

By default, the MapReduce computing framework calculates the corresponding hash value based on the key entered by map, and then distributes the key-value pairs evenly among multiple reducer according to the resulting hash value. This means that when we use SORT BY, there is a significant overlap in the output of different reducer, at least in terms of ordering, even if the output of each reducer is ordered.

DISTRIBUTE BY and GROUP BY are similar in controlling how reducer accepts rows of data for processing, while SORT BY controls how the data in reducer is sorted.

It is important to note that Hive requires the DISTRIBUTE BY statement to unload the SORT BY statement before.

22 、 CLUSTER BY

CLUSTER BY not only has the function of DISTRIBUTE BY but also has the function of SORT BY.

Using the DISTRIBUTE BY.SORT BY statement or its simplified version of the CLUSTER BY statement deprives SORT BY of parallelism, but this enables the data of the output file to be sorted globally.

23. ROW_NUMBER () takes TOP 1 for grouping and sorting

Hive > SELECT * FROM (SELECT m.qqroomfwbzhm m.xxzjbh as flag FROM zb_src.zy_sh_qqdjxx m.xxzjbhreww number () OVER (PARTITION BY m.qq_fwbzh ORDER BY m.xxrksj DESC) as flag FROM zb_src.zy_sh_qqdjxx m) t WHERE t.flag=1

24. Substitution in hive

Select regexp_replace (sfzbh, "'", "), regexp_replace (glkh,", ") from DWD_Z_SG_BG_MGJCRY limit 10

25. To achieve the horizontal and vertical function in hive, instructions: lateral view explode (column name, 'delimiter'). This function must have an alias.

Example: select gmsfhm, gddh2 from zb_dwa.DWA_R_JB_RYDHHMK t lateral view explode (split (t.gddhdh)) an as gddh2 where gmsfhm='152301198209100568'

152632196712060315, 13088573907,13034744906

Convert into

152632196712060315 13088573907

152632196712060315 13034744906

26. Hive format conversion

Cast (c.code as int) = cast (a.mz_dm as int)

27. Hive vertical and horizontal

Original:

Sys_region (id,name)

1 a

1 b

2 c

2 d

Select id,concat_ws (',', collect_set (name))

From sys_region

Group by id

Results:

1 a,b

2 c,d

28. Modify table notes

Alter table DWA_R_GJ_GLKYGPXX set tblproperties ('comment'=' Highway passenger ticketing Information Integration Table')

29. Some Hive optimizations

Set hive.groupby.skewindata=true; / / when the data is skewed, load balancing will be carried out automatically

Whether the final output of set hive.exec.compress.output=true; / / hive is compressed

Output Compression of set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; / / map

Set mapred.output.compression.type=BLOCK; / / compression type. Default is RECORD. Individual records are compressed. BLOCK is block compression.

Set mapreduce.map.memory.mb=2049; / / memory size per map

Set mapreduce.reduce.memory.mb=2049; / / memory size per reduce

Set hive.exec.parallel=true; / / controls whether different job in the same sql can run at the same time. Default is false.

Set hive.exec.parallel.thread.number=4; / / controls the maximum amount of job that can be run simultaneously for the same sql. The default is 8.

Set mapred.max.split.size=256000000; / / determines the maximum file size processed by each map, in B

The size of the smallest file that can be processed in the set mapred.min.split.size.per.node=100000000; / / node

The size of the smallest file that can be processed in the set mapred.min.split.size.per.rack=100000000; / / rack

Set hive.merge.mapfiles=true; / / merge small files at the end of Map-only 's task

Set hive.merge.mapredfiles=true; / / merge small files at the end of Map-Reduce 's task

Set hive.merge.size.per.task=128000000; / / merge file size

Set hive.meger.smallfiles.avgsize=100000000; / / when the average size of the output file is less than this value, start a separate map-reduce task to merge files

Input InputFormat of set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; / / hive

Does set hive.hadoop.supports.splittable.combineinputformat=true; / / support split combineinputformat?

Set mapred.reduce.tasks=10

Whether the final output of set hive.exec.compress.output=true; / / hive is compressed

Set mapred.compress.map.output=false; / / hadoop parameter, whether the map output is compressed

Set mapred.output.compress=true; / / hadoop parameter, whether the reduce output is compressed

30. Query timestamp

Hive > select from_unixtime (unix_timestamp ()) from test

31. Some regularities

Select gmsfhm from DWD_R_JG_ZDRKXX where gmsfhm not rlike "^ [0-9] {15} $" and gmsfhm not rlike "^ [0-9] {17} [0-9Xx] {1} $"

Select * from dwd_r_jg_zdrkxx where lxdh rlike "^ + 86 [0-9] {11} $"

Select * from dwd_r_jg_zdrkxx where lxdh rlike "^ [0-9] {4}-[0-9] {7} $"

Select * from dwd_r_jg_zdrkxx where lxdh rlike "^ + 86 [0-9] {11} $" or lxdh rlike "^ [0-9] {4}-[0-9] {7} $"

Get current time: select from_unixtime (unix_timestamp (), 'yyyy-MM-dd HH:mm:ss')

Matching name: select xm from dwd_r_jg_zdrkxx where xm not rlike "^ [\ u4e00 -\\ u9fa5] + [.]? [\\ u4e00 -\\ u9fa5] + $"

Select * from dwd_r_jg_zdrkxx where xm is null

32. Hive creates the table and specifies the delimiter

Hive (zb_dim) > create table code_zylb (code string,name string) ROW FORMAT delimited fields terminated by'\ t'

Time taken: 0.131 seconds

Hive (zb_dim) > load data local inpath'/ home/hadoop/code_zylb.txt' into table code_zylb

Loading data to table zb_dim.code_zylb

Table zb_dim.code_zylb stats: [numFiles=1, totalSize=10765]

Time taken: 0.426 seconds

33. Add a field

Alter table civil_aviation_in_port add columns (sfzh string comment'ID number')

34. Start hiveserver2 in DEBUG mode and specify the hiveserver2 server

Nohup hive-service hiveserver2-hiveconf hive.root.logger=DEBUG,console-hiveconf hive.server2.thrift.bind.host=hadoop02 &

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.