In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
1. Display the database of the operation in front of the command line prompt:
Hive > set hive.cli.print.current.db=true
2. Query result display field name:
Hive > set hive.cli.print.header=true
3. Execute the shell command:
Hive >! Ls
4. Execute the dfs command:
Hive > dfs-ls
5. In the hive script,-- represents a comment and cannot be used on the command line
6. Data type conversion:
Hive > cast (s as int)
7. Display the details of the table:
Hive > desribe formatted/extended tablename
8. Set Hive to "strict" mode, so that if the partition table is queried and the where clause is not partitioned, the task will be prohibited from being submitted.
Hive > set hive.mapred.mode=strict/nostrict
9. How to write dynamic partition insert data:
Hive > insert overwrite table table1 partition (a from test2 b) select..., c <...
10. Rand (): returns a random value of 0: 1
11. Learn how Hive converts queries into MapReduce tasks:
Hive > explain select sum (a) from test
Hive > explain extended select sum (a) from test
12. Specify that the log is at DEBUG level and output to the console
Hive-hiveconf hive.root.logger=DEBUG,console
13. Discover and describe functions:
Hive > show functions
Hive > describe function concat
Hive > describe function extended concat
14. Standard function: UDF (one to one)
Aggregate function: UDAF (many to one)
Table generation function: UNTF (one to many)
15. Macro commands (sometimes more convenient than functions)
Hive > create temporary macro sigmoid (x double) 1.0 / (1.0 + exp (- x))
Hive > select sigmoid (2) from src limit 1
16. About users, groups, and roles in Hive
Hive (default) > set hive.security.authorization.enabled
Hive.security.authorization.enabled=false
Hive (default) > set system:user.name
System:user.name=hadoop
Hive (default) > show grant user hadoop
OK
Default hadoop USER ALL false 1522203964000 hive
Test hadoop USER ALL false 1522205200000 hive
Zb_dwa hadoop USER ALL false 1522203974000 hive
Zb_dwd hadoop USER ALL false 1522203972000 hive
Zb_src hadoop USER ALL false 1522203972000 hive
Zb_src_test hadoop USER ALL false 1522203972000 hive
Zb_dwa t_code hadoop USER ALL true 1524211395000 hadoop
Zb_dwa t_code_source hadoop USER ALL true 1524204335000 hadoop
Time taken: 2.214 seconds, Fetched: 18 row (s)
Hive (default) > show grant user hadoop on database default
OK
Default hadoop USER ALL false 1522203964000 hive
Time taken: 0.026 seconds, Fetched: 1 row (s)
17. Under what circumstances can Hive avoid MapReduce?
Hive > set hive.exec.mode.local.auto=true
Queries in Hive for some cases do not have to use MapReduce, which is the so-called local schema. For example: SELECT * FROM employees; in this case, Hive can simply read the files in the storage directory corresponding to employees, and then output the formatted content to the console. In the case where the filter condition in the WHERE statement is only a partition field, there is no need to use the MapReduce procedure.
18. JOIN optimization
□ when JOIN joins three or more tables, only one MapReduce job is generated if each ON clause uses the same join key.
□ Hive also assumes that the last table in the query is the largest table. When joining each row of records, it tries to cache the other tables and then scans the last table for calculation. Therefore, the user needs to ensure that the size of the table in the continuous query increases in turn from left to right.
□ Hive also provides a "tag" mechanism to explicitly tell the optimizer which table is a large table, as follows:
Hive > SELECT / * + STREAMTABLE (s) * / s.ymdrect s.symbolbolrect s.priceclosed FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE d.dividend FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s.symbolbole = 'AAPL'
19 、 map-side JOIN
If only one of all tables is a small table, you can put the small table completely in memory when the largest table passes through mapper. Hive can perform the join process (called map-side JOIN) on the map side because the Hive can match the small tables in memory one by one, thus omitting the reduce process required for regular join operations. Even for a small amount of data, this optimization is significantly faster than regular join operations. It not only reduces the reduce process, but also sometimes reduces the execution steps of the map process.
In versions prior to Hive v0.7, if you wanted to use this optimization, you needed to add a tag to the query statement to trigger it.
Hive > SELECT / * + MAPJOIN (d) * / s.ymddictions.symbolbolrecents.priceclosed FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE d.dividend FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s.symbolbol= 'AAPL'
This way of tagging has been abandoned since Hive v0.7, but it is still valid to add this tag. The user needs to configure the property hive.auto.convert.join to be true, which is false by default. The user can also configure the size of the small table that can use this optimization, with the configuration property hive.mapjoin.smalltable.filesize and the default value 25000000.
Hive > set hive.auto.convert.join = true
Hive > set hive.mapjoin.smalltable.filesize = 25000000
Hive > SELECT s. Ymd FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s. Symbolboldres. Priceblocks close.dividend FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s.symbol = 'AAPL'.
Hive does not support this optimization for right outer connections and full outer connections.
If the data in all tables is bucket-by-bucket, this optimization can also be used for large tables in certain cases. To put it simply, the data in the table must be divided into buckets according to the keys in the ON statement, and the number of buckets in one table must be several times that of the other table. When these conditions are met, the Hive can connect according to bucket data during the map phase. However, this optimization is also not enabled by default. You need to set the parameter hive.optimize.bucketmapjoin to true and default to false.
Hive > set hive.optimize.bucketmapjoin = true
20, ORDER BY and SORT BY
ORDER BY in Hive performs a global sort of results. In other words, there will be a process in which all data is processed through a reducer. For large datasets, this process can take too long. If the value of the attribute hive.mapred.mode is strict, then Hive requires that such statements must be restricted by a LIMIT statement. By default, the value of this property is nostrict.
SORT BY only sorts the data in each reducer, which is the process of performing a local sort. This ensures that the output data of each reducer is ordered (but not globally). This improves the efficiency of subsequent global sorting.
If the number of reducer used is greater than 1, the order of the output will be very different. Since the output of each reducer is only guaranteed to be locally sorted, then the output of different reducer may overlap.
21. DISTRIBUTE BY with SORT BY
DISTRIBUTE BY controls how the output of map is divided in reducer (dividing the data into different reducer according to the specified fields). All data transferred in MapReduce job is organized in key-value pairs, so when Hive converts a user's query statement to MapReduce job, it must use this feature internally.
By default, the MapReduce computing framework calculates the corresponding hash value based on the key entered by map, and then distributes the key-value pairs evenly among multiple reducer according to the resulting hash value. This means that when we use SORT BY, there is a significant overlap in the output of different reducer, at least in terms of ordering, even if the output of each reducer is ordered.
DISTRIBUTE BY and GROUP BY are similar in controlling how reducer accepts rows of data for processing, while SORT BY controls how the data in reducer is sorted.
It is important to note that Hive requires the DISTRIBUTE BY statement to unload the SORT BY statement before.
22 、 CLUSTER BY
CLUSTER BY not only has the function of DISTRIBUTE BY but also has the function of SORT BY.
Using the DISTRIBUTE BY.SORT BY statement or its simplified version of the CLUSTER BY statement deprives SORT BY of parallelism, but this enables the data of the output file to be sorted globally.
23. ROW_NUMBER () takes TOP 1 for grouping and sorting
Hive > SELECT * FROM (SELECT m.qqroomfwbzhm m.xxzjbh as flag FROM zb_src.zy_sh_qqdjxx m.xxzjbhreww number () OVER (PARTITION BY m.qq_fwbzh ORDER BY m.xxrksj DESC) as flag FROM zb_src.zy_sh_qqdjxx m) t WHERE t.flag=1
24. Substitution in hive
Select regexp_replace (sfzbh, "'", "), regexp_replace (glkh,", ") from DWD_Z_SG_BG_MGJCRY limit 10
25. To achieve the horizontal and vertical function in hive, instructions: lateral view explode (column name, 'delimiter'). This function must have an alias.
Example: select gmsfhm, gddh2 from zb_dwa.DWA_R_JB_RYDHHMK t lateral view explode (split (t.gddhdh)) an as gddh2 where gmsfhm='152301198209100568'
152632196712060315, 13088573907,13034744906
Convert into
152632196712060315 13088573907
152632196712060315 13034744906
26. Hive format conversion
Cast (c.code as int) = cast (a.mz_dm as int)
27. Hive vertical and horizontal
Original:
Sys_region (id,name)
1 a
1 b
2 c
2 d
Select id,concat_ws (',', collect_set (name))
From sys_region
Group by id
Results:
1 a,b
2 c,d
28. Modify table notes
Alter table DWA_R_GJ_GLKYGPXX set tblproperties ('comment'=' Highway passenger ticketing Information Integration Table')
29. Some Hive optimizations
Set hive.groupby.skewindata=true; / / when the data is skewed, load balancing will be carried out automatically
Whether the final output of set hive.exec.compress.output=true; / / hive is compressed
Output Compression of set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; / / map
Set mapred.output.compression.type=BLOCK; / / compression type. Default is RECORD. Individual records are compressed. BLOCK is block compression.
Set mapreduce.map.memory.mb=2049; / / memory size per map
Set mapreduce.reduce.memory.mb=2049; / / memory size per reduce
Set hive.exec.parallel=true; / / controls whether different job in the same sql can run at the same time. Default is false.
Set hive.exec.parallel.thread.number=4; / / controls the maximum amount of job that can be run simultaneously for the same sql. The default is 8.
Set mapred.max.split.size=256000000; / / determines the maximum file size processed by each map, in B
The size of the smallest file that can be processed in the set mapred.min.split.size.per.node=100000000; / / node
The size of the smallest file that can be processed in the set mapred.min.split.size.per.rack=100000000; / / rack
Set hive.merge.mapfiles=true; / / merge small files at the end of Map-only 's task
Set hive.merge.mapredfiles=true; / / merge small files at the end of Map-Reduce 's task
Set hive.merge.size.per.task=128000000; / / merge file size
Set hive.meger.smallfiles.avgsize=100000000; / / when the average size of the output file is less than this value, start a separate map-reduce task to merge files
Input InputFormat of set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; / / hive
Does set hive.hadoop.supports.splittable.combineinputformat=true; / / support split combineinputformat?
Set mapred.reduce.tasks=10
Whether the final output of set hive.exec.compress.output=true; / / hive is compressed
Set mapred.compress.map.output=false; / / hadoop parameter, whether the map output is compressed
Set mapred.output.compress=true; / / hadoop parameter, whether the reduce output is compressed
30. Query timestamp
Hive > select from_unixtime (unix_timestamp ()) from test
31. Some regularities
Select gmsfhm from DWD_R_JG_ZDRKXX where gmsfhm not rlike "^ [0-9] {15} $" and gmsfhm not rlike "^ [0-9] {17} [0-9Xx] {1} $"
Select * from dwd_r_jg_zdrkxx where lxdh rlike "^ + 86 [0-9] {11} $"
Select * from dwd_r_jg_zdrkxx where lxdh rlike "^ [0-9] {4}-[0-9] {7} $"
Select * from dwd_r_jg_zdrkxx where lxdh rlike "^ + 86 [0-9] {11} $" or lxdh rlike "^ [0-9] {4}-[0-9] {7} $"
Get current time: select from_unixtime (unix_timestamp (), 'yyyy-MM-dd HH:mm:ss')
Matching name: select xm from dwd_r_jg_zdrkxx where xm not rlike "^ [\ u4e00 -\\ u9fa5] + [.]? [\\ u4e00 -\\ u9fa5] + $"
Select * from dwd_r_jg_zdrkxx where xm is null
32. Hive creates the table and specifies the delimiter
Hive (zb_dim) > create table code_zylb (code string,name string) ROW FORMAT delimited fields terminated by'\ t'
OK
Time taken: 0.131 seconds
Hive (zb_dim) > load data local inpath'/ home/hadoop/code_zylb.txt' into table code_zylb
Loading data to table zb_dim.code_zylb
Table zb_dim.code_zylb stats: [numFiles=1, totalSize=10765]
OK
Time taken: 0.426 seconds
33. Add a field
Alter table civil_aviation_in_port add columns (sfzh string comment'ID number')
34. Start hiveserver2 in DEBUG mode and specify the hiveserver2 server
Nohup hive-service hiveserver2-hiveconf hive.root.logger=DEBUG,console-hiveconf hive.server2.thrift.bind.host=hadoop02 &
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.