How to optimize join in Hive 07/15 Update SLTechnology News&Howtos

How to optimize join in Hive

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to optimize the join in Hive. Xiaobian thinks it is quite practical, so share it with you for reference. I hope you can gain something after reading this article.

1. Demand

I simplified the requirements, very simple, two tables to do a join, find the specified city, pv per day, written in traditional RDBMS SQL like this:

SELECT t.statdate, c.cname, count(t.cookieid)FROM tmpdb.city cJOIN ecdata.ext_trackflow t ON (t.area1= c.cname OR t.area2 =c.cname OR t.area3 = c.cname)WHERE t.statdate>='20140818' and t.statdate='20140818' AND t.statdate='20140818' AND t.statdate='20140818' AND t.statdate='20140818' AND t.statdate='20140818' AND t.statdate='20140818' AND t.statdate='20140818' AND t.statdate a1.txt

The final optimization effect is: the statement in 2 does not produce results for three hours... 5 is about 8 times faster than 4, 6 is about 2 times faster than 5, and the final result is 10 minutes.

7. Final question:

When the statement in 6 is executed, you will notice that it scans the source file three times. Hive itself is optimized for union all's join. When multiple union all subqueries the same table, only scan the source file once, but why scan each of the three subqueries once?

It may be that the union all subquery here uses join, which causes the union all execution plan optimization of hive to fail.

8. About the Cartesian set in hive (full Cartesian product)

There is no ON join key in the JION Consecutive Query, and Cartesian sets are generated by the WHERE condition statement.

Hive itself does not support Cartesian sets, so select T1.*, T2.* from table1, table2. But sometimes when you do need to use Cartesian sets, you can use the following syntax to achieve the same effect:

select T1.*, T2.* from table1 T1 join table2 T2 where 1=1;

Note that this syntax cannot be used in Hive's Strict mode because it produces Cartesian sets, which are forbidden in this mode. You need to use set hive.mapred.mode=nonstrict; set it to nonstrict mode to use, or change where to on connection.

select T1.*, T2.* from table1 T1 join table2 T2 on T1.id=T2.id;

9. About Strict Mode

The strict pattern in Hive prevents user queries (which can be problematic) from inadvertently causing undesirable effects. Set hive.mapred.mode to strict to suppress three types of queries:

1) On a partitioned table, if no specific partition is specified in the WHERE condition, then this is not allowed, in other words, full table scanning on the partitioned table is not allowed. The reason for this limitation is that partitioned tables usually hold very large datasets and may grow rapidly. Doing a full table scan on such a large table consumes a lot of resources and must specify partitions in the WHERE filter condition to execute successful queries.

2) The second is to prohibit the execution of HiveQL queries with ORDER BY sorting requirements but no LIMIT statement. Because ORDER BY global queries result in a single reducer sorting all query results, this can lead to unpredictable execution times if you sort large datasets, and limit conditions must be added to execute successful queries.

3) The third is to prohibit the generation of Cartesian sets. There is no ON connection key in the JION consecutive query, and the Cartesian set will be generated by the WHERE condition statement, which needs to be changed to JOIN... ON statement.

About "Hive join how to optimize" this article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it to let more people see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.