In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
8 Hive Shell operation
8.1 introduction to scripts under Hive bin
8.2 Hive Shell basic Operation
1. Hive command line
Hive [- hiveconf xroomy] * [] * [|] [- S]
-I initialize HQL from a file
-e execute the specified HQL from the command line
-f execute HQL script
-v outputs the executed HQL statement to the console
-p connect to HiveServer on port number-hiveconf Xeroy Use this to set hive/hadoop configurationvariables.
Hive command line example
Executes the specified sql statement from the command line
$HIVE_HOME/bin/hive-e 'select a.colfrom tab1 a'
Executes the specified sql statement with the specified hive environment variable
$HIVE_HOME/bin/hive-e 'select a.colfrom tab1 a'-hiveconf hive.exec.scratchdir=/home/my/hive_scratch-hiveconfmapred.reduce.tasks=32
Executes the specified sql statement in silent mode and exports the execution result to the specified file:
HIVE_HOME/bin/hive-e'select a.col from tab1 a'> a.txt
Execute sql files in non-interactive mode
HIVE_HOME/bin/hive-f/home/my/hive-script.sql
Execute the initialization sql file before entering interactive mode
HIVE_HOME/bin/hive-i/home/my/hive-init.sql
Hive Interactive Shell Command
When the command $HIVE_HOME/bin/hive is run without the-e _ HIVE_HOME/bin/hive option, hive enters interactive mode
End the command line with a (;) colon
8.3 Lo
Hive uses Log4J to process logs
We can design the log level of Hive with the following command
$HIVE_HOME/bin/hive-hiveconfhive.root.logger=INFO,console
Hive.root.logger has INFO,DEBUG, etc.
8.4 Resources
Hive add Resources
Hive can dynamically add resources, such as files
Normally, we add files when interacting with Hive
It is actually controlled by Hadoop's Distributed Cache.
Examples
ADD {FILE [S] | JAR [S] | ARCHIVE [S]} [] *
LIST {FILE [S] | JAR [S] | ARCHIVE [S]} [..]
DELETE {FILE [S] | JAR [S] | archive [S]} [..]
9 Hive optimization
9.1 Features of the Hadoop computing framework
1. What is data skew
Due to the imbalance of data, resulting in uneven distribution of data, resulting in a large number of data concentrated to one point, resulting in data hotspots.
2. Characteristics of Hadoop framework
Not afraid of large data, afraid of data tilt
Jobs with a large number of jobs are relatively inefficient. For example, even if a table with hundreds of rows is associated with multiple summaries many times, more than a dozen jobs will be generated, which will take a long time. The reason is that map reduce jobs take a long time to initialize.
UDAF, such as sum,count,max,min, are not afraid of data tilting. The summary, merging and optimization of hadoop on the map side makes data tilting not a problem.
Count (distinct), in the case of a large amount of data, is less efficient, because count (distinct) is grouped by group by field and sorted by distinct field. Generally, this distribution is very skewed.
9.2 Common means of optimization
Solve the problem of data skew
Reduce the number of job
Setting a reasonable number of task for map reduce can effectively improve performance.
It is a good choice to understand the data distribution and solve the data tilt problem by yourself.
When the amount of data is large, use count (distinct) cautiously.
Merging small files is an effective way to improve scheduling efficiency.
Grasp the whole when optimizing, and the best of a single job is not as good as that of the whole.
9.3 Optimization of data types in Hive-- Optimization principle
Partition according to certain rules (for example, by date). Through partitioning, specifying partitions when querying will greatly reduce the scanning of useless data and make it very convenient for data cleaning.
Set Buckets reasonably. In the case of some big data join, map join sometimes runs out of memory. If you use Bucket Map Join, you can put only one of the bucket in memory, and the memory tables that can't fit in memory can be put down. This requires a conditional link to join using the key of buckets, and requires the following settings
Set hive.optimize.bucketmapjoin = true
9.4 operational optimization of Hive
Full sort
How to make Cartesian product
How to determine the number of map
How to determine the number of reducer
Merge MapReduce operation
Bucket and sampling
Partition
JOIN
Group By
Merge small files
1. Full sorting
The sorting keyword of Hive is SORTBY, which is intentionally different from the ORDER BY of traditional databases to emphasize the difference between the two-SORTBY can only be sorted within the scope of a single machine.
2. How to make Cartesian product
Cartesian products are not allowed in HQL statements when Hive is set to strict mode (hive.mapred.mode=strict)
MapJoin is the solution.
MapJoin, as its name implies, completes the Join operation on the Map side. This requires that one or more tables of the Join operation be fully read into memory
The use of MapJoin is to add / * + MAPJOIN (tablelist) * / to prompt the optimizer to convert to MapJoin after the SELECT keyword of the query / subquery (currently the optimizer of Hive cannot automatically optimize MapJoin)
Where tablelist can be a table or a list of tables joined by commas. The table in tablelist will be read into memory, and the small table should be written here
When making a Cartesian product between a large table and a small table, the way to avoid the Cartesian product is to add a Join key to Join. The principle is simple: expand the small table by a column of join key, and copy the entries of the small table several times, the join key is different; expand the large table by a column of join key to random numbers
3. Control the number of Map of Hive
Typically, a job generates one or more map tasks through the directory of input
The main determining factors are the total number of files in input, the file size of input, and the file block size set by the cluster (currently 128m, which can be seen in hive through the setdfs.block.size; command. This parameter cannot be customized and modified).
Is it true that the more map, the better?
The answer is no. If a task has many small files (far less than the block size 128m), each small file will also be treated as a block and completed with a map task, while a map task will take much longer to start and initialize than the logical processing time, which will result in a great waste of resources. Moreover, the number of simultaneous executable map is limited.
Does it make sure that each map handles close to 128m of file blocks, so you can rest easy?
The answer is not necessarily. For example, a 127m file is normally completed with a map, but this file has only one or two small fields and tens of millions of records.
If the logic of map processing is complex, it must be time-consuming to do it with a map task.
To solve the above problems 3 and 4, we need to take two ways to solve them: reducing the number of map and increasing the number of map.
Does it make sure that each map handles close to 128m of file blocks, so you can rest easy?
The answer is not necessarily. For example, a 127m file is normally completed with a map, but this file has only one or two small fields and tens of millions of records.
If the logic of map processing is complex, it must be time-consuming to do it with a map task.
To solve the above problems 3 and 4, we need to take two ways to solve them: reducing the number of map and increasing the number of map.
Give an example
A) assuming that there is a file an in the input directory with a size of 780m, then hadoop splits the file an into seven blocks (six 128m blocks and one 12m block), resulting in seven map numbers
B) suppose there are three files in the input directory, the size of which is 10m, 20m, 130m, respectively, then the hadoop will be divided into four blocks (10m, 20m, 128m, 2m), resulting in four map numbers.
That is, if the file is larger than the block size (128m), it will be split, and if it is less than the block size, the file will be treated as a block
How to determine the number of reducer
In the Hadoop MapReduce program, the setting of the number of reducer greatly affects the execution efficiency.
If the number of reducer is not specified, Hive guesses to determine the number of reducer based on the following two settings:
Parameter 1:hive.exec.reducers.bytes.per.reducer (default is 1G)
Parameter 2: hive.exec.reducers.max (default is 999)
Formula for calculating reducer number
N=min (Parameter 2, Total input data / Parameter 1)
Based on the experience of Hadoop, parameter 2 can be set to 0.95 * (number of TaskTracker in the cluster)
The number of reduce is not the more the better.
Like map, starting and initializing reduce takes time and resources
In addition, there will be as many output files as there are reduce. If many small files are generated, then if these small files are used as inputs for the next task, there will be the problem of too many small files.
When is there only one reduce?
In many cases, you will find that no matter how much data is in the task, no matter whether you set the parameter to adjust the number of reduce, there is always only one reduce task in the task.
In fact, there is only one reduce task, except that the amount of data is less than
In addition to the value of the hive.exec.reducers.bytes.per.reducer parameter, there are the following reasons:
A) there is no summary of group by
B) Order by is used
5. Merge MapReduce operations
Multi-group by
Multi-group by is a very good feature of Hive, which makes it very convenient to use intermediate results in Hive.
FROM log
Insert overwrite table test1 select log.id group by log.id
Insert overwrite table test2select log.name group by log.name
The above query statement uses the Multi-group by feature to groupby the data twice in a row, using different groupby key. This feature reduces one MapReduce operation.
6. Bucket and Sampling
Bucket refers to hash,hash the data into a specified number of buckets with the value of the specified column as key. So that efficient sampling can be supported.
Sampling can sample all the data, so it is naturally inefficient, and it still has to access all the data. If a table has already made bucket for a column, it can sample a bucket of all buckets with a specified ordinal, which reduces the number of visits.
As shown in the following example, the third of the 32 buckets in test is sampled.
SELECT * FROM test, TABLESAMPLE (BUCKET 3 OUT OF 32)
7. JOIN principle
There is a rule when using query statements with Join operations: tables / subqueries with fewer entries should be placed to the left of the Join operator
The reason is that during the Reduce phase of the Join operation, the contents of the table to the left of the Join operator are loaded into memory, and placing the table with fewer entries on the left can effectively reduce the chance of OOM errors.
8 、 Map Join
The Join operation is completed in the Map phase, and Reduce is no longer required, as long as the required data can be accessed during the Map process.
For example:
INSERT OVERWRITE TABLE phone_trafficSELECT / * + MAPJOIN (phone_location) * / l.phonePower.location.traffic from phone_location p join log l on (p.phone=l.phone)
The relevant parameters are:
Hive.join.emit.interval = 1000How many rows in the right-most join operand Hive should buffer before emittingthe join result.hive.mapjoin.size.key = 10000hive.mapjoin.cache.numrows = 10000
9 、 Group By
Map end partial polymerization
Not all aggregation operations need to be completed on the Reduce side. Many aggregation operations can be partially aggregated on the Map side, and finally the final result can be obtained on the Reduce side.
Based on Hash
Parameters include:
Hive.map.aggr = whether true is aggregated on the Map side. Default is True.
Hive.groupby.mapaggr.checkinterval = 100000 number of entries that perform aggregation operations on the Map side
Load balancing when data is skewed
Hive.groupby.skewindata = false
When the option is set to true, the generated query plan has two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group ByKey may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.
10. Merge small files
Too many files will put pressure on HDFS and affect processing efficiency, which can be eliminated by merging the result files of Map and Reduce:
Hive.merge.mapfiles = whether true and Map output files. Default is True.
Hive.merge.mapredfiles = whether false merges Reduce output files. Default is False.
Hive.merge.size.per.task = 256 "1000" 1000 merged file size
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.