Big data, a good programmer, shares the learning route of hive. 04/22 Update SLTechnology News&Howtos

Big data, a good programmer, shares the learning route of hive.

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Good programmer big data learns the route to share the operation of hive. The property settings of hive are: 1, set on the client (only for the current session) 3, set in the java code (current connection) 2, set it in the configuration file (all session is valid)

The priority of setting properties is decreased in turn. The cli side can only set the properties required for non-hive startup. (log property, metadata connection property)

Find all attributes: hive > set; to check the value of the current attribute: usually hadoop hive > set-v; Fuzzy search attribute: hive-S-e "set" | grep current; hive-S-e "set" | grep index

Hive variables: system, env, hivevar, hiveconf

System: system-level environment variables (jvm, hadoop, etc.), readable and writable hive > set system:min.limit=3; hive > set system:min.limit; system:min.limit=3

Env: environment variable (HADOOP_HOME), read only, not write. Hive > set env:PWD; env:PWD=/usr/local/hive-1.2.1

Hivevar: custom temporary variable (readable and writable)

Hive > set hivevar:min.limit=3

Hive > set hivevar:min.limit

Hivevar:min.limit=3

Hive > set hivevar:min.limit=2

Hive > set hivevar:min.limit

Hivevar:min.limit=2

Hiveconf: custom temporary property variables (readable and writable)

Hive > set hiveconf:max.limit=10

Hive > set hiveconf:max.limit

Hiveconf:max.limit=10

Hive > set hiveconf:max.limit=6

Hive > set hiveconf:max.limit

Hiveconf:max.limit=6

The operation mode of hive: 1, client operation (temporary statistics, development) 2, hive-S-e "hql statement"; (suitable for a single hql query statement) 3, hive-S-f / hql file; (script of hql file)

Without parameters

Hive-S-e "use qf1603;select * from user1;" hive-S-f / home/su.hql

Hive does not support-f execution with parameters prior to version 0.9:

Hive-- hivevar min_limit=3-hivevar-hivevar t_n=user1-e 'use qf1603;select * from {hive:t_n} limit {hivevar:min_limit};

Hive-- hiveconf min_lit=3-e "use qf1603;select * from user1 limit ${hiveconf:min_lit};"

Hive-S-- hiveconf t_n=user1-- hivevar min_limit=3-f. / su.hql

Comments in hive:-- comment content

Insert overwrite local directory'/ home/out/05'

Select * from user1 limit 3

# 3. Hive optimization 1, environment optimization (number of linux handles, application memory allocation, load, etc.) 2, optimization of application configuration attributes. 3. Code optimization (hql, try to change the way hql is written).

1. Learn to watch explain

Explain: displays the plan for the hql query. Explain extended: displays the plan for the hql query. Hql's abstract expression tree is also displayed. (that's what the interpreter does)

Explain select from user1

Explain extended select from user1

A hql statement will consist of one or more stage. Each stage is equivalent to a job of mr, and stage can be a Fetch, map join, limit, and so on. Each stage is executed in turn according to dependencies, and those without dependencies can be executed in parallel.

2. Optimization of limit:

Hive.limit.row.max.size=100000

Hive.limit.optimize.limit.file=10

Hive.limit.optimize.enable=false

3. Optimization of join:

Always small table-driven large tables (small result sets drive large result sets) use small table identification / + STREAMTABLE (small table alias) / adjust the business to try to use map-side join: hive.auto.convert.join: smalltable: try to avoid Cartesian product of join queries, even if there is also need to use on or where to filter. Hive's current join only supports equivalent connections (= and). The rest is not good.

4. Use hive local mode (run in a jvm)

Hive.exec.mode.local.auto=false

Hive.exec.mode.local.auto.inputbytes.max=134217728

Hive.exec.mode.local.auto.input.files.max=4

5. Hive parallel execution (if there is no interdependence between stage, it can be executed in parallel)

Hive.exec.parallel=false

Hive.exec.parallel.thread.number=8

6. Strict mode:

The strict mode provided by hive blocks three kinds of queries: 1, query 2 with partitioned table, query 3 with orderby, and join query statement without on condition or where condition.

7. Set the number of mapper and reduce

The number of mapper is too much, the startup time is too long, the number is too small, the number of resources is not fully utilized, the number of reducer is too many, the startup time is too small, and the resources are not fully utilized.

Number of mapper: manually set:

Set mapred.map.tasks=2

Adjust the block size appropriately to change the number of fragments to change the number of mapper:

Reduce the number of mapper by merging small files:

Set mapred.max.split.size=25600000; 256M

Set mapred.min.split.per.node=1

Set mapred.min.split.per.rack=1

Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Number of reducer (usually set manually):

Set mapreduce.job.reduces=-1

8. Hive uses jvm to reuse

Number of tasks running task in mapreduce.job.jvm.numtasks=1 set mapred.job.reuse.jvm.num=8; # # jvm

9. Data skew (see: Hive optimized .docx document)

Data skew: the values of a column of data are unevenly distributed. The causes of data skew: 1, the original data is skewed 2, hql statements may cause 3, join is very easy to cause 4, count (distinct col) 5, group by statements are also easy

Solution: 1. If the data itself is skewed, see whether the data can be separated directly (find the skewed data) 2, calculate the skewed data separately, then union all 3 with the normal data, assign the skewed data to random numbers for join query, and balance the task volume of each task. 4. Try to rewrite the hql statement without changing the requirement.

Several property settings for tilt resolution:

Hive.map.aggr=true

Hive.groupby.skewindata=false

Hive.optimize.skewjoin=false

10. Control of the number of job

The join field types in the on of the join query are as similar as possible. Usually a simple hql statement generates a job, and it is possible to generate a separate job with join, limit, and group by.

Select

U.uid

U.uname

From user1 u

Where u.uid in (select l.uid from login l where l.uid=1 limit 1)

Select

U.uid

U.uname

From user1 u

Join login l

On u.uid = l.uid

Where l.uid = 1

Partitioning, bucket splitting, and indexing are in themselves an optimization of hive.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.