Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Hive sql optimized data skew

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

This script runs slowly, mainly due to the skew of data on the reduce side. Knowing that the dw.fct_traffic_navpage_path_ Detl table is used to collect user click data, then in the end,

There must be very few clicks on shopping carts and placing orders, so there is a huge amount of data with empty ordr_code field and NULL cart_prod_id field in this table, as shown below:

Select ordr_code,count (*) as a from dw.fct_traffic_navpage_path_detl where ds = '2015-05-10' group by ordr_code having a > 10000

151722135

Select cart_prod_id,count (*) as a fromdw.fct_traffic_navpage_path_detl where ds = '2015-05-10' groupby cart_prod_id having a > 10000

NULL 127233335

For the sentence SQL,BI of create table tmp_lifan_trfc_tpa as, add the following configuration

Set hive.mapjoin.smalltable.filesize = 120000000; / / because the maximum size of the dw.univ_parnt_tranx_comb_ table does not exceed 120MB. If you want to use hive.auto.convert.join.noconditionaltask.size for hive on tez, tez will generate BROADCAST.

Sethive.auto.convert.join=true

At the same time, modify the following SQL statement:

From dw.fct_traffic_navpage_path_detl t

Left outer join dw.univ_parnt_tranx_comb_detl o / / using mapjoin to solve data skew

On t.ordr_code = o.parnt_ordr_code

And t.cart_prod_id = o.comb_prod_id

And o.ds = '2015-05-10'

Left outer join bic.cust_first_ordr_tranx f

On case when o.end_user_id is null then cast (rand (9) * 100as bigint) else o.end_user_id end = f.end_user_id / / join is followed by a random number to avoid tilting. Red is the modified part.

And f.first_ordr_date_id = '2015-05-10'

Where t.ds = '2015-05-10'

After running, the SQL can be completed in a controllable time.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report