In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how to use the SQL window function for business data analysis", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use the SQL window function for business data analysis"!
Data preparation
The main analysis of this paper only involves an order table orders, and the operation process is completed in Hive. The specific data are as follows:
-- create a table
CREATE TABLE orders (
Order_id int
Customer_id string
City string
Add_time string
Amount decimal (10Pol 2))
-- prepare data
INSERT INTO orders VALUES
(1, "A", "Shanghai", "2020-01-01 0012", "00.000000", 200)
(2, "B", "Shanghai", "2020-01-05 0012", 00.000000 ", 250)
(3, "C", "Beijing", "2020-01-1200 rime 000.000000", 200)
(4, "A", "Shanghai", "2020-02-04 00", "00.000000", 400)
(5, "D", "Shanghai", "2020-02-05 00rig 00.000000", 250)
(5, "D", "Shanghai", "2020-02-05 12-0-0-0-5 12-0-0-0-000000", 300)
(6, "C", "Beijing", "2020-02-19 00 rime 00.000000", 300)
(7, "A", "Shanghai", "2020-03-01 00rig 0000000", 150)
(8, "E", "Beijing", "2020-03-05 0015" 000.000000 ", 500)
(9, "F", "Shanghai", "2020-03-09 00rig 00.000000", 250)
(10, "B", "Shanghai", "2020-03-21 00, 00.000000", 600)
Demand 1: income growth
On the business side, the revenue growth for the first month is calculated as follows: 100 * (m1-m0) / M0
Where M1 is the income of the given month and M0 is the income of the previous month. So, technically, we need to find each month's income and then in some way associate each month's income with the previous income in order to do the above calculation. The calculation at that time is as follows:
WITH
Monthly_revenue as (
SELECT
Trunc (add_time,'MM') as month
Sum (amount) as revenue
FROM orders
GROUP BY 1
)
, prev_month_revenue as (
SELECT
Month
Revenue
Lag (revenue) over (order by month) as prev_month_revenue-income from the previous month
FROM monthly_revenue
)
SELECT
Month
Revenue
Prev_month_revenue
Round (100.0 * (revenue-prev_month_revenue) / prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
Result output
Monthrevenueprev_month_revenuerevenue_growth2020-01-01650NULLNULL2020-02-01125065092.32020-03-011500125020
We can also look at the income growth of a city in a certain month by grouping by city.
WITH
Monthly_revenue as (
SELECT
Trunc (add_time,'MM') as month
City
Sum (amount) as revenue
FROM orders
GROUP BY 1,2
)
, prev_month_revenue as (
SELECT
Month
City
Revenue
Lag (revenue) over (partition by city order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
Month
City
Revenue
Round (100.0 * (revenue-prev_month_revenue) / prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 2,1
Result output
Monthcityrevenuerevenue_growth2020-01-01 Shanghai 450NULL2020-02-01 Shanghai 950111.12020-03-01 Shanghai 10005.32020-01 Beijing 200NULL2020-02-01 Beijing 300502020-03-01 Beijing 50066.7 demand 2: cumulative summation
Cumulative summary, that is, the sum of the current element and all previous elements, such as the following SQL:
WITH
Monthly_revenue as (
SELECT
Trunc (add_time,'MM') as month
Sum (amount) as revenue
FROM orders
GROUP BY 1
)
SELECT
Month
Revenue
Sum (revenue) over (order by month rows between unbounded preceding and current row) as running_total
FROM monthly_revenue
ORDER BY 1
Result output
Monthrevenuerunning_total2020-01-016506502020-02-01125019002020-03-0115003400
We can also use the following combination to analyze, the SQL is as follows:
SELECT
Order_id
Customer_id
City
Add_time
Amount
Sum (amount) over () as amount_total,-- summation of all data
Sum (amount) over (order by order_id rows between unbounded preceding and current row) as running_sum,-- cumulative summation
Sum (amount) over (partition by customer_id order by add_time rows between unbounded preceding and current row) as running_sum_by_customer
Avg (amount) over (order by add_time rows between 5 preceding and current row) as trailing_avg-- Rolling average
FROM orders
ORDER BY 1
Result output:
Order_idcustomer_idcityadd_timeamountamount_totalrunning_sumrunning_sum_by_customertrailing_avg1A Shanghai 2020-01-01 00Rover 00.000000200340020020022002B Shanghai 2020-01-05 0000Rd 00.000025034004502502253C Beijing 2020-01-1200Rd 00.0000002003400650200216.6666674A Shanghai 2020-02-0400: 000040034001050600262.55D Shanghai 2020-02-05 0000D2020-05 12dev 24000000.00000030034001600550266.66666676C 00.00000030034001900500283.3333337A Shanghai 2020-03-01 00Purse 00purl 00.00000015034002050750266.66678E Beijing 2020-03-0500: 00VO0.00000050034002550500316.6666679F Shanghai 2020-03-090000Discovery 00.000025034002800291.66666710B Shanghai 2020-03-2100000000000000340034003400850
Requirement 3: deal with duplicate data
As can be seen from the above data, there are two duplicate data * * (5, "D", "Shanghai", "2020-02-05 00VERV 00lv 00.000000", 250), (5, "D", "Shanghai", "2020-02-05 12VRV 00000.000000", 300). * * it obviously needs to be cleaned and duplicated, and the latest data is retained. SQL is as follows:
Let's first rank in groups, and then keep the latest data:
SELECT *
FROM (
SELECT *
Row_number () over (partition by order_id order by add_time desc) as rank
FROM orders
) t
WHERE rank=1
Result output:
T.order_idt.customer_idt.cityt.add_timet.amountt.rank1A Shanghai 2020-01-01 0000RV 00.00000020012B Shanghai 2020-01-05 00RV 00.000025013C Beijing 2020-02-0400: 00Rd 00.00000040015D Shanghai 2020-02-05 120000JV 00.00000030016C Beijing 2020-02-190000000030017A Shanghai 2003-01 0000000018E Beijing 2020 -03-0500: 00.00000050019F Shanghai 2020-03-09000000000000000000.0000250110B Shanghai 2020-03-2100000000006001
After the above cleaning process, the data is removed. Recalculate requirement 1 above, and the correct SQL script is:
WITH
Orders_cleaned as (
SELECT *
FROM (
SELECT *
Row_number () over (partition by order_id order by add_time desc) as rank
FROM orders
) t
WHERE rank=1
)
, monthly_revenue as (
SELECT
Trunc (add_time,'MM') as month
Sum (amount) as revenue
FROM orders_cleaned
GROUP BY 1
)
, prev_month_revenue as (
SELECT
Month
Revenue
Lag (revenue) over (order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
Month
Revenue
Round (100.0 * (revenue-prev_month_revenue) / prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
Result output:
Monthrevenuerevenue_growth2020-01-01650NULL2020-02-01100053.82020-03-01150050
Create a view of the cleaned data for later use
CREATE VIEW orders_cleaned AS
SELECT
Order_id
Customer_id
City
Add_time
Amount
FROM (
SELECT *
Row_number () over (partition by order_id order by add_time desc) as rank
FROM orders
) t
WHERE rank=1
Requirement 4: TopN for grouping
Grouping topN is the longest SQL window function usage scenario. The following SQL calculates the top2 order amount for each month, as follows:
WITH orders_ranked as (
SELECT
Trunc (add_time,'MM') as month
*,
Row_number () over (partition by trunc (add_time,'MM') order by amount desc, add_time) as rank
FROM orders_cleaned
)
SELECT
Month
Order_id
Customer_id
City
Add_time
Amount
FROM orders_ranked
WHERE rank
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.