Introduction to Analysis of Hive (3) 04/27 Update SLTechnology News&Howtos

Introduction to Analysis of Hive (3)

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

8 Hive Shell operation

8.1 introduction to scripts under Hive bin

8.2 Hive Shell basic Operation

1. Hive command line

Hive [- hiveconf xroomy] * [] * [|] [- S]

-I initialize HQL from a file

-e execute the specified HQL from the command line

-f execute HQL script

-v outputs the executed HQL statement to the console

-p connect to HiveServer on port number-hiveconf Xeroy Use this to set hive/hadoop configurationvariables.

Hive command line example

Executes the specified sql statement from the command line

$HIVE_HOME/bin/hive-e 'select a.colfrom tab1 a'

Executes the specified sql statement with the specified hive environment variable

$HIVE_HOME/bin/hive-e 'select a.colfrom tab1 a'-hiveconf hive.exec.scratchdir=/home/my/hive_scratch-hiveconfmapred.reduce.tasks=32

Executes the specified sql statement in silent mode and exports the execution result to the specified file:

HIVE_HOME/bin/hive-e'select a.col from tab1 a'> a.txt

Execute sql files in non-interactive mode

HIVE_HOME/bin/hive-f/home/my/hive-script.sql

Execute the initialization sql file before entering interactive mode

HIVE_HOME/bin/hive-i/home/my/hive-init.sql

Hive Interactive Shell Command

When the command $HIVE_HOME/bin/hive is run without the-e _ HIVE_HOME/bin/hive option, hive enters interactive mode

End the command line with a (;) colon

8.3 Lo

Hive uses Log4J to process logs

We can design the log level of Hive with the following command

$HIVE_HOME/bin/hive-hiveconfhive.root.logger=INFO,console

Hive.root.logger has INFO,DEBUG, etc.

8.4 Resources

Hive add Resources

Hive can dynamically add resources, such as files

Normally, we add files when interacting with Hive

It is actually controlled by Hadoop's Distributed Cache.

Examples

ADD {FILE [S] | JAR [S] | ARCHIVE [S]} [] *

LIST {FILE [S] | JAR [S] | ARCHIVE [S]} [..]

DELETE {FILE [S] | JAR [S] | archive [S]} [..]

9 Hive optimization

9.1 Features of the Hadoop computing framework

1. What is data skew

Due to the imbalance of data, resulting in uneven distribution of data, resulting in a large number of data concentrated to one point, resulting in data hotspots.

2. Characteristics of Hadoop framework

Not afraid of large data, afraid of data tilt

Jobs with a large number of jobs are relatively inefficient. For example, even if a table with hundreds of rows is associated with multiple summaries many times, more than a dozen jobs will be generated, which will take a long time. The reason is that map reduce jobs take a long time to initialize.

UDAF, such as sum,count,max,min, are not afraid of data tilting. The summary, merging and optimization of hadoop on the map side makes data tilting not a problem.

Count (distinct), in the case of a large amount of data, is less efficient, because count (distinct) is grouped by group by field and sorted by distinct field. Generally, this distribution is very skewed.

9.2 Common means of optimization

Solve the problem of data skew

Reduce the number of job

Setting a reasonable number of task for map reduce can effectively improve performance.

It is a good choice to understand the data distribution and solve the data tilt problem by yourself.

When the amount of data is large, use count (distinct) cautiously.

Merging small files is an effective way to improve scheduling efficiency.

Grasp the whole when optimizing, and the best of a single job is not as good as that of the whole.

9.3 Optimization of data types in Hive-- Optimization principle

Partition according to certain rules (for example, by date). Through partitioning, specifying partitions when querying will greatly reduce the scanning of useless data and make it very convenient for data cleaning.

Set Buckets reasonably. In the case of some big data join, map join sometimes runs out of memory. If you use Bucket Map Join, you can put only one of the bucket in memory, and the memory tables that can't fit in memory can be put down. This requires a conditional link to join using the key of buckets, and requires the following settings

Set hive.optimize.bucketmapjoin = true

9.4 operational optimization of Hive

Full sort

How to make Cartesian product

How to determine the number of map

How to determine the number of reducer

Merge MapReduce operation

Bucket and sampling

Partition

JOIN

Group By

Merge small files

1. Full sorting

The sorting keyword of Hive is SORTBY, which is intentionally different from the ORDER BY of traditional databases to emphasize the difference between the two-SORTBY can only be sorted within the scope of a single machine.

2. How to make Cartesian product

Cartesian products are not allowed in HQL statements when Hive is set to strict mode (hive.mapred.mode=strict)

MapJoin is the solution.

MapJoin, as its name implies, completes the Join operation on the Map side. This requires that one or more tables of the Join operation be fully read into memory

The use of MapJoin is to add / * + MAPJOIN (tablelist) * / to prompt the optimizer to convert to MapJoin after the SELECT keyword of the query / subquery (currently the optimizer of Hive cannot automatically optimize MapJoin)

Where tablelist can be a table or a list of tables joined by commas. The table in tablelist will be read into memory, and the small table should be written here

When making a Cartesian product between a large table and a small table, the way to avoid the Cartesian product is to add a Join key to Join. The principle is simple: expand the small table by a column of join key, and copy the entries of the small table several times, the join key is different; expand the large table by a column of join key to random numbers

3. Control the number of Map of Hive

Typically, a job generates one or more map tasks through the directory of input

The main determining factors are the total number of files in input, the file size of input, and the file block size set by the cluster (currently 128m, which can be seen in hive through the setdfs.block.size; command. This parameter cannot be customized and modified).

Is it true that the more map, the better?

The answer is no. If a task has many small files (far less than the block size 128m), each small file will also be treated as a block and completed with a map task, while a map task will take much longer to start and initialize than the logical processing time, which will result in a great waste of resources. Moreover, the number of simultaneous executable map is limited.

Does it make sure that each map handles close to 128m of file blocks, so you can rest easy?

The answer is not necessarily. For example, a 127m file is normally completed with a map, but this file has only one or two small fields and tens of millions of records.

If the logic of map processing is complex, it must be time-consuming to do it with a map task.

To solve the above problems 3 and 4, we need to take two ways to solve them: reducing the number of map and increasing the number of map.

Does it make sure that each map handles close to 128m of file blocks, so you can rest easy?

The answer is not necessarily. For example, a 127m file is normally completed with a map, but this file has only one or two small fields and tens of millions of records.

If the logic of map processing is complex, it must be time-consuming to do it with a map task.

To solve the above problems 3 and 4, we need to take two ways to solve them: reducing the number of map and increasing the number of map.

Give an example

A) assuming that there is a file an in the input directory with a size of 780m, then hadoop splits the file an into seven blocks (six 128m blocks and one 12m block), resulting in seven map numbers

B) suppose there are three files in the input directory, the size of which is 10m, 20m, 130m, respectively, then the hadoop will be divided into four blocks (10m, 20m, 128m, 2m), resulting in four map numbers.

That is, if the file is larger than the block size (128m), it will be split, and if it is less than the block size, the file will be treated as a block

How to determine the number of reducer

In the Hadoop MapReduce program, the setting of the number of reducer greatly affects the execution efficiency.

If the number of reducer is not specified, Hive guesses to determine the number of reducer based on the following two settings:

Parameter 1:hive.exec.reducers.bytes.per.reducer (default is 1G)

Parameter 2: hive.exec.reducers.max (default is 999)

Formula for calculating reducer number

N=min (Parameter 2, Total input data / Parameter 1)

Based on the experience of Hadoop, parameter 2 can be set to 0.95 * (number of TaskTracker in the cluster)

The number of reduce is not the more the better.

Like map, starting and initializing reduce takes time and resources

In addition, there will be as many output files as there are reduce. If many small files are generated, then if these small files are used as inputs for the next task, there will be the problem of too many small files.

When is there only one reduce?

In many cases, you will find that no matter how much data is in the task, no matter whether you set the parameter to adjust the number of reduce, there is always only one reduce task in the task.

In fact, there is only one reduce task, except that the amount of data is less than

In addition to the value of the hive.exec.reducers.bytes.per.reducer parameter, there are the following reasons:

A) there is no summary of group by

B) Order by is used

5. Merge MapReduce operations

Multi-group by

Multi-group by is a very good feature of Hive, which makes it very convenient to use intermediate results in Hive.

FROM log

Insert overwrite table test1 select log.id group by log.id

Insert overwrite table test2select log.name group by log.name

The above query statement uses the Multi-group by feature to groupby the data twice in a row, using different groupby key. This feature reduces one MapReduce operation.

6. Bucket and Sampling

Bucket refers to hash,hash the data into a specified number of buckets with the value of the specified column as key. So that efficient sampling can be supported.

Sampling can sample all the data, so it is naturally inefficient, and it still has to access all the data. If a table has already made bucket for a column, it can sample a bucket of all buckets with a specified ordinal, which reduces the number of visits.

As shown in the following example, the third of the 32 buckets in test is sampled.

SELECT * FROM test, TABLESAMPLE (BUCKET 3 OUT OF 32)

7. JOIN principle

There is a rule when using query statements with Join operations: tables / subqueries with fewer entries should be placed to the left of the Join operator

The reason is that during the Reduce phase of the Join operation, the contents of the table to the left of the Join operator are loaded into memory, and placing the table with fewer entries on the left can effectively reduce the chance of OOM errors.

8 、 Map Join

The Join operation is completed in the Map phase, and Reduce is no longer required, as long as the required data can be accessed during the Map process.

For example:

INSERT OVERWRITE TABLE phone_trafficSELECT / * + MAPJOIN (phone_location) * / l.phonePower.location.traffic from phone_location p join log l on (p.phone=l.phone)

The relevant parameters are:

Hive.join.emit.interval = 1000How many rows in the right-most join operand Hive should buffer before emittingthe join result.hive.mapjoin.size.key = 10000hive.mapjoin.cache.numrows = 10000

9 、 Group By

Map end partial polymerization

Not all aggregation operations need to be completed on the Reduce side. Many aggregation operations can be partially aggregated on the Map side, and finally the final result can be obtained on the Reduce side.

Based on Hash

Parameters include:

Hive.map.aggr = whether true is aggregated on the Map side. Default is True.

Hive.groupby.mapaggr.checkinterval = 100000 number of entries that perform aggregation operations on the Map side

Load balancing when data is skewed

Hive.groupby.skewindata = false

When the option is set to true, the generated query plan has two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group ByKey may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

10. Merge small files

Too many files will put pressure on HDFS and affect processing efficiency, which can be eliminated by merging the result files of Map and Reduce:

Hive.merge.mapfiles = whether true and Map output files. Default is True.

Hive.merge.mapredfiles = whether false merges Reduce output files. Default is False.

Hive.merge.size.per.task = 256 "1000" 1000 merged file size

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.