Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to improve the Speed of retention Computing by ClickHouse

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to improve the retention computing speed of ClickHouse. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

User retention is an indispensable function of each big data analysis platform. Enterprises generally use retention rate to measure the activity of users, and it is also a direct index that can directly reflect the functional value of products. Retention rate is one of the most important indicators to measure user quality, so calculating various retention rates is the basic skill at the bottom of data analysis. So here are some practical examples of user retention analysis.

1. Prepare for

Understand the current retention rate of several conventional calculation methods, understand that ClickHouse provides retention (cond1, cond2, …) Function to calculate the retention rate

Table creation: user basic information table: login_event

CREATE TABLE login_event-user login event (ID', of `accountId` String COMMENT 'account-user unique ID `ds` Date COMMENT' date-user login date) ENGINE = MergeTreePARTITION BY accountIdORDER BY accountId

Derivative: insert August user login data

-inserting data insert into login_event values (10001recorder toDate ('2020-08-01'), (10001recorder toDate (' 2020-08-08')), (10001memotoDate ('2020-08-09')), (10001memtoDate (' 2020-08-10')), (10001memtoDate ('2020-08-12')), (10001memtoDate (' 2020-08-13')), (10001memtoDate ('2020-08-14'), (10001) ToDate ('2020-08-15'), (10001 to date ('2020-08-16'), (10001 to date ('2020-08-17)), (10001 to date (' 2020-08-18'), (10001 to date ('2020-08-20)), (10001 to date (' 2020-08-22')), (10001 to date ('2020-08-23)), (10001 to date (2020-08-24)), (10001 to date (2020-08-23)), (10001 to date (2020-08-24)) ToDate ('2020-08-20'), (10002 to date ('2020-08-22'), (10002 to date ('2020-08-23)), (10002 to date (' 2020-08-01'), (10002 to date (2020-08-11)), (10002 to date ('2020-08-12)), (10002 to date (' 2020-08-13)), (10002 to date ('2020-08-20')), (10002 to date ('2020-08-13)), (10002 to date (' 2020-08-20')), (10002 to date ('2020-08-13')), (10002 to date (2020-08-13)), (10002 to date ( ToDate ('2020-08-15'), (10002 to date ('2020-08-30)), (10002 to date (' 2020-08-20'), (10002 to date ('2020-08-01'), (10002 to date ('2020-08-06)), (10002 to date (' 2020-08-24'), (10003 to date ('2020-08-05')), (10003 to date ('2020-08-08)), (10003 to date (' 2020-08-08')), (10003 to date ('2020-08-08')), ToDate ('2020-08-09'), (10003 to date (2020-08-10)), (10003 to date (2020-08-11)), (10003 to date (2020-08-13), (10003 to date (2020-08-15), (10003 to date (2020-08-16)), (10003 to date (2020-08-18)), (10003 to date (2020-08-20)), (10003 to date (2020-08-18)), (10003 to date (2020-08-20)), (10003 ToDate ('2020-08-01'), (10003 to date ('2020-08-21)), (10003 to date (' 2020-08-22'), (10003 to date ('2020-08-24'), (10003 to date ('2020-08-26'), (10003 to date ('2020-08-25')), (10003 to date ('2020-08-27')), (10003 to date (' 2020-08-28')), (10003 to date ('2020-08-28')) ToDate ('2020-08-29'), (10003 to date ('2020-08-30)), (10004 to date (' 2020-08-01'), (10004 to date ('2020-08-02'), (10004 to date (2020-08-03)), (10004 to date ('2020-08-04')), (10004 to date ('2020-08-05'), (10004 to date ('2020-08-08)), (10004 to date (' 2020-08-08')), (10004 to date ('2020-08-05'), (10004 to date (' 2020-08-08')), (10004 to date ('2020-08-08')), (10004 to date (' 2020-08-05')), (10004 to date ('2020-08-05')), ( ToDate ('2020-08-09'), (10004 to date (2020-08-10)), (10004 to date (2020-08-11)), (10004 to date (2020-08-14), (10004 to date (2020-08-15), (10004 to date (2020-08-16)), (10004 to date (2020-08-17)), (10004 to date (2020-08-19')), (10004 to date (2020-08-19')), (10004 to date (2020-08-19')), (10004 ToDate ('2020-08-20'), (10004 to date (2020-08-21)), (10004 to date (2020-08-22), (10004 to date (2020-08-23), (10004 to date (2020-08-24)), (10004 to date (2020-08-23)), (10004 to date (2020-08-23)) ToDate ('2020-08-27'), (10004 to date (' 2020-08-30')) two。 Topic analysis

Calculate the secondary stay, 3 stay, 7 stay, 14 stay, and 30 stay of active users in a certain day. We divide the problem into three steps:

Find the number of daily active users

Find the login status of the number of daily active users on the 2nd, 3rd, 6th, 13th and 29th

Calculate the login number of active users on the 2nd, 3rd, 6th, 13th and 29th day, and calculate the N-day retention rate.

Solution 1:

-- calculate the retention number of 2020-08-01 active users on the 2nd, 3rd, 6th, 13th and 29th day The retention rates SELECT ds, count (accountIdD0) AS activeAccountNum, count (accountIdD1) / count (accountIdD0) AS `second retention, count (accountIdD3) / count (accountIdD0) AS `3retention, count (accountIdD7) / count (accountIdD0) AS `7, count (accountIdD14) / count (accountIdD0) AS `14 are calculated. Count (accountIdD30) / count (accountIdD0) AS `30 leave `FROM (--use LEFT JOIN to find 2020-08-01 login users on the 2nd, 3rd, 6th, 13th and 29th login users SELECT DISTINCT a.ds AS ds, a.accountIdD0 AS accountIdD0, IF (b.accountId =', NULL, b.accountId) AS accountIdD1, IF (c.accountId =', NULL, c.accountId) AS accountIdD3 IF (d.accountId =', NULL, d.accountId) AS accountIdD7, IF (e.accountId =', NULL, e.accountId) AS accountIdD14, IF (f.accountId =', NULL, f.accountId) AS accountIdD30 FROM (--find the active users of the day SELECT DISTINCT ds in 2020-08-01 AccountId AS accountIdD0 FROM login_event WHERE ds = '2020-08-01' ORDER BY ds ASC) AS a LEFT JOIN test.login3_event AS b ON (b.ds = addDays (a.ds, 1)) AND (a.accountIdD0 = b.accountId) LEFT JOIN test.login3_event AS c ON (c.ds = addDays (a.ds) 2) AND (a.accountIdD0 = c.accountId) LEFT JOIN test.login3_event AS d ON (d.ds = addDays (a.ds, 6)) AND (a.accountIdD0 = d.accountId) LEFT JOIN test.login3_event AS e ON (e.ds = addDays (a.ds, 13) AND (a.accountIdD0 = e.accountId) LEFT JOIN test.login3_event AS f ON (f.ds = addDays (a.ds) 29) AND (a.accountIdD0 = f.accountId) AS tempGROUP BY ds result:-- ┌─ ds ─┬─ activeAccountNum ─┬─ secondary ─┬── 3 ─┬─ 7 ─┬─ 14 ─┬─ 30 ─┐│ 2020-08-01 │ 4 │ 0.25 │ 0.25 │ 0 │ 0.5 │ 0.75 │└─┴─┴─┘ 1 rows in set. Elapsed: 0.022 sec.

Solution 2:

-- judge the retention number of 2020-08-01 active users on the 2nd, 3rd, 6th, 13th and 29th, and calculate the retention rate The retention rates SELECT DISTINCT b.ds AS ds, ifnull (countDistinct (if (a.ds = b.ds, a.accountId, NULL)), 0) AS activeAccountNum, ifnull (if (a.ds = addDays (b.ds, 1), b.accountId, NULL) / activeAccountNum, 0) AS `secondary retention `, ifnull (countDistinct (if (a.ds = addDays (b.ds, 2), b.accountId, NULL) / activeAccountNum, 0) AS `3 retention` are calculated. Ifnull (countDistinct (if (a.ds = addDays (b.ds, 6), b.accountId, NULL) / activeAccountNum, 0) AS `7Liu`, ifnull (if (a.ds = addDays (b.ds, 13), b.accountId, NULL) / activeAccountNum, 0) AS `14 Liu`, ifnull (countDistinct (if (a.ds = addDays (b.ds, 29), b.accountId, NULL)) / activeAccountNum 0) AS `30 leave `FROM-- use INNER JOIN to find out the login status of 2020-08-01 active users during the next 1-30 days (SELECT ds, accountId FROM login_event WHERE (ds = '2020-08-01')) AS aINNER JOIN-- find 2020-08-01 active users (SELECT DISTINCT accountId) Ds FROM test.login3_event WHERE ds = '2020-08-01') AS b ON a.accountId = b.accountIdGROUP BY ds result:-- ┌─ ds ─┬─ activeAccountNum ─┬─ secondary ─┬── 3 ─┬─ 7 ─┬─ 14 ─┬─ 30 ─┐│ 2020-08-01 │ 4 │ 0.25 │ 0.25 │ 0 │ 0.5 │ 0.75 │└─┴─┴─┘ 1 rows in set. Elapsed: 0.019 sec.

Solution 3:

-- obtain the retention number of 2020-08-01 active users on the 2nd, 3rd, 6th, 13th and 29th according to the array subscript SUM (r [index]) The retention rates of SELECT toDate ('2020-08-01') AS ds, SUM (r [1]) AS activeAccountNum, SUM (r [2]) / SUM (r [1]) AS `times, SUM (r [3]) / SUM (r [1]) AS `3`, SUM (r [4]) / SUM (r [1]) AS `7 `, SUM (r [5]) / SUM (r [1]) AS `14 are calculated. SUM (r [6]) / SUM (r [1]) AS `30 stay `FROM-- find the login status of 2020-08-01 active users on the 2nd, 3rd, 6th, 13th and 29th day 1Log on / not logged in (WITH toDate ('2020-08-01') AS tt SELECT accountId, retention (toDate (ds) = tt, toDate (subtractDays (ds, 1)) = tt, toDate (subtractDays (ds, 2)) = tt, toDate (subtractDays (ds, 6)) = tt, toDate (subtractDays (ds, 13)) = tt, toDate (subtractDays (ds)) 29) = tt) AS r-find 2020-08-01 active users login data FROM login_eventWHERE (ds > = '2020-08-01') AND (ds) in the next 1-30 days

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 257

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report