How to use SQL window function to analyze Business data 07/09 Update SLTechnology News&Howtos

How to use SQL window function to analyze Business data

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use the SQL window function for business data analysis", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use the SQL window function for business data analysis"!

Data preparation

The main analysis of this paper only involves an order table orders, and the operation process is completed in Hive. The specific data are as follows:

-- create a table

CREATE TABLE orders (

Order_id int

Customer_id string

City string

Add_time string

Amount decimal (10Pol 2))

-- prepare data

INSERT INTO orders VALUES

(1, "A", "Shanghai", "2020-01-01 0012", "00.000000", 200)

(2, "B", "Shanghai", "2020-01-05 0012", 00.000000 ", 250)

(3, "C", "Beijing", "2020-01-1200 rime 000.000000", 200)

(4, "A", "Shanghai", "2020-02-04 00", "00.000000", 400)

(5, "D", "Shanghai", "2020-02-05 00rig 00.000000", 250)

(5, "D", "Shanghai", "2020-02-05 12-0-0-0-5 12-0-0-0-000000", 300)

(6, "C", "Beijing", "2020-02-19 00 rime 00.000000", 300)

(7, "A", "Shanghai", "2020-03-01 00rig 0000000", 150)

(8, "E", "Beijing", "2020-03-05 0015" 000.000000 ", 500)

(9, "F", "Shanghai", "2020-03-09 00rig 00.000000", 250)

(10, "B", "Shanghai", "2020-03-21 00, 00.000000", 600)

Demand 1: income growth

On the business side, the revenue growth for the first month is calculated as follows: 100 * (m1-m0) / M0

Where M1 is the income of the given month and M0 is the income of the previous month. So, technically, we need to find each month's income and then in some way associate each month's income with the previous income in order to do the above calculation. The calculation at that time is as follows:

WITH

Monthly_revenue as (

SELECT

Trunc (add_time,'MM') as month

Sum (amount) as revenue

FROM orders

GROUP BY 1

)

, prev_month_revenue as (

SELECT

Month

Revenue

Lag (revenue) over (order by month) as prev_month_revenue-income from the previous month

FROM monthly_revenue

)

SELECT

Month

Revenue

Prev_month_revenue

Round (100.0 * (revenue-prev_month_revenue) / prev_month_revenue,1) as revenue_growth

FROM prev_month_revenue

ORDER BY 1

Result output

Monthrevenueprev_month_revenuerevenue_growth2020-01-01650NULLNULL2020-02-01125065092.32020-03-011500125020

We can also look at the income growth of a city in a certain month by grouping by city.

WITH

Monthly_revenue as (

SELECT

Trunc (add_time,'MM') as month

City

Sum (amount) as revenue

FROM orders

GROUP BY 1,2

)

, prev_month_revenue as (

SELECT

Month

City

Revenue

Lag (revenue) over (partition by city order by month) as prev_month_revenue

FROM monthly_revenue

)

SELECT

Month

City

Revenue

Round (100.0 * (revenue-prev_month_revenue) / prev_month_revenue,1) as revenue_growth

FROM prev_month_revenue

ORDER BY 2,1

Result output

Monthcityrevenuerevenue_growth2020-01-01 Shanghai 450NULL2020-02-01 Shanghai 950111.12020-03-01 Shanghai 10005.32020-01 Beijing 200NULL2020-02-01 Beijing 300502020-03-01 Beijing 50066.7 demand 2: cumulative summation

Cumulative summary, that is, the sum of the current element and all previous elements, such as the following SQL:

WITH

Monthly_revenue as (

SELECT

Trunc (add_time,'MM') as month

Sum (amount) as revenue

FROM orders

GROUP BY 1

)

SELECT

Month

Revenue

Sum (revenue) over (order by month rows between unbounded preceding and current row) as running_total

FROM monthly_revenue

ORDER BY 1

Result output

Monthrevenuerunning_total2020-01-016506502020-02-01125019002020-03-0115003400

We can also use the following combination to analyze, the SQL is as follows:

SELECT

Order_id

Customer_id

City

Add_time

Amount

Sum (amount) over () as amount_total,-- summation of all data

Sum (amount) over (order by order_id rows between unbounded preceding and current row) as running_sum,-- cumulative summation

Sum (amount) over (partition by customer_id order by add_time rows between unbounded preceding and current row) as running_sum_by_customer

Avg (amount) over (order by add_time rows between 5 preceding and current row) as trailing_avg-- Rolling average

FROM orders

ORDER BY 1

Result output:

Order_idcustomer_idcityadd_timeamountamount_totalrunning_sumrunning_sum_by_customertrailing_avg1A Shanghai 2020-01-01 00Rover 00.000000200340020020022002B Shanghai 2020-01-05 0000Rd 00.000025034004502502253C Beijing 2020-01-1200Rd 00.0000002003400650200216.6666674A Shanghai 2020-02-0400: 000040034001050600262.55D Shanghai 2020-02-05 0000D2020-05 12dev 24000000.00000030034001600550266.66666676C 00.00000030034001900500283.3333337A Shanghai 2020-03-01 00Purse 00purl 00.00000015034002050750266.66678E Beijing 2020-03-0500: 00VO0.00000050034002550500316.6666679F Shanghai 2020-03-090000Discovery 00.000025034002800291.66666710B Shanghai 2020-03-2100000000000000340034003400850

Requirement 3: deal with duplicate data

As can be seen from the above data, there are two duplicate data * * (5, "D", "Shanghai", "2020-02-05 00VERV 00lv 00.000000", 250), (5, "D", "Shanghai", "2020-02-05 12VRV 00000.000000", 300). * * it obviously needs to be cleaned and duplicated, and the latest data is retained. SQL is as follows:

Let's first rank in groups, and then keep the latest data:

SELECT *

FROM (

SELECT *

Row_number () over (partition by order_id order by add_time desc) as rank

FROM orders

) t

WHERE rank=1

Result output:

T.order_idt.customer_idt.cityt.add_timet.amountt.rank1A Shanghai 2020-01-01 0000RV 00.00000020012B Shanghai 2020-01-05 00RV 00.000025013C Beijing 2020-02-0400: 00Rd 00.00000040015D Shanghai 2020-02-05 120000JV 00.00000030016C Beijing 2020-02-190000000030017A Shanghai 2003-01 0000000018E Beijing 2020 -03-0500: 00.00000050019F Shanghai 2020-03-09000000000000000000.0000250110B Shanghai 2020-03-2100000000006001

After the above cleaning process, the data is removed. Recalculate requirement 1 above, and the correct SQL script is:

WITH

Orders_cleaned as (

SELECT *

FROM (

SELECT *

Row_number () over (partition by order_id order by add_time desc) as rank

FROM orders

) t

WHERE rank=1

)

, monthly_revenue as (

SELECT

Trunc (add_time,'MM') as month

Sum (amount) as revenue

FROM orders_cleaned

GROUP BY 1

)

, prev_month_revenue as (

SELECT

Month

Revenue

Lag (revenue) over (order by month) as prev_month_revenue

FROM monthly_revenue

)

SELECT

Month

Revenue

Round (100.0 * (revenue-prev_month_revenue) / prev_month_revenue,1) as revenue_growth

FROM prev_month_revenue

ORDER BY 1

Result output:

Monthrevenuerevenue_growth2020-01-01650NULL2020-02-01100053.82020-03-01150050

Create a view of the cleaned data for later use

CREATE VIEW orders_cleaned AS

SELECT

Order_id

Customer_id

City

Add_time

Amount

FROM (

SELECT *

Row_number () over (partition by order_id order by add_time desc) as rank

FROM orders

) t

WHERE rank=1

Requirement 4: TopN for grouping

Grouping topN is the longest SQL window function usage scenario. The following SQL calculates the top2 order amount for each month, as follows:

WITH orders_ranked as (

SELECT

Trunc (add_time,'MM') as month

Row_number () over (partition by trunc (add_time,'MM') order by amount desc, add_time) as rank

FROM orders_cleaned

)

SELECT

Month

Order_id

Customer_id

City

Add_time

Amount

FROM orders_ranked

WHERE rank

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.