How does AnalyticDB support large-scale and low-cost real-time analysis of data banks? 04/16 Update SLTechnology News&Howtos

How does AnalyticDB support large-scale and low-cost real-time analysis of data banks?

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how AnalyticDB supports the ultra-large-scale and low-cost real-time analysis of data banks. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Preface

Data bank is a commercial data product operated by brand consumers. Because its core analysis capability needs to achieve free analysis of any dimension and strong demand on response time on massive data, we use AnalyticDB as the underlying analysis engine on a large scale. Finally, with low cost and excellent performance, we support millions of OLAP queries every day during the promotion of tens of thousands of brands.

At present, the data bank stores about ten trillion pieces of data in AnalyticDB, occupies about 1.6p of storage space, and the average response time of the query is less than 5 seconds.

Introduction of data banking business

As a commercial data product operated by consumers, data bank provides many data capabilities, such as link flow analysis, crowd circle selection, crowd portrait and so on.

Link flow analysis

AIPL is a unique indicator of the data bank, which is used to measure the relationship between brands and consumers (AIPL is the abbreviation of four stages, A cognition, I interest, P purchase, L loyalty). Link flow analysis is used to obtain the changes of consumer AIPL relationship of a brand in any two days (such as the following figure, the changes of a brand in a certain category from last year's Singles' Day to this year's Singles' Day, unreal data).

In this scenario, users can choose any two dates in nearly 540 days, plus brand and category dimensions, and the user's possible input is at the 100 trillion level.

Crowd portrait

Crowd portrait is the core ability of consumers to operate products. In addition to portraying specific groups of people precipitated by users, data banks can also portray people in link flow to help brands analyze the reasons for changes in consumer relations (as shown in the following figure, a brand's Singles Day last year is a portrait of the people in the state of purchase, but this year's Singles Day is a portrait of the lost state, which is not real data).

In this scenario, the data bank provides users with more than 200 tags, most of which are related to the industry. Only some tags are involved in each crowd portrait. Calculating all the tags for the population in advance will lead to a great waste of resources.

Crowd circle selection count / crowd circle selection

Crowd selection is the core ability of consumers to operate products. Compared with most consumer products that restrict users to use tag data, data bank crowd selection (minute level) allows users to use various kinds of data such as tags and contacts (which can be understood as consumer behavior, such as buying, searching, watching live streaming, etc.). At the same time, users can also instantly view the number of consumers under the selection conditions (seconds).

In this scenario, all kinds of circle selection conditions can be freely combined by intersection, union and difference, while some of the circle selection conditions, such as the purchase amount, are values for users to fill in, which cannot be enumerated.

Why did the data bank choose AnalyticDB

For ordinary analysis business, if there is no requirement for response time, offline computing (Hadoop/Hive/Maxcompute) can meet almost all the needs of data analysis, but from the point of view of users' online response, high-frequency functions will have a strong demand for response time.

For example, user decision-making requires a large number of comparisons of crowd portraits, and the choice of the crowd depends on the result of the previous portrait. If offline computing is used, it will not only greatly prolong the user's decision-making time, but also break the continuity of user's analytical thinking, and have a great impact on the user experience.

There are generally two ways to solve the response time problem:

Pre-calculation, the indicators under the combination of all the optional dimensions of the user are calculated offline first, and when the user is analyzing, the system goes directly to the database to get the results.

OLAP online calculation stores the slightly aggregated data (retaining all user optional dimensions) in the MPP engine, and calculates the indicators in real time according to the conditions submitted by the user.

These two ideas have their own characteristics, and precomputation needs to consider a series of problems, such as dimension explosion, offline precomputation can not be completed in a limited time, or the waste of resources caused by the change of demand that the pre-calculated results are not used. On the other hand, OLAP can calculate freely in any dimension without pre-calculation, but it also needs to consider the storage cost, capacity and computing performance of MPP engine.

Generally speaking, as a consumer-oriented data product, it has a strong demand for response time and is not suitable for the use of pre-calculation; at the same time, because of the huge amount of data (tens of trillions, PB level), the overall cost is also an important consideration.

OLAP engine selection

The OLAP of data banks has several major challenges:

Challenge in the amount of data: the amount of original data is very large, the total amount of historical data has 1.6PB, 22 trillion records, of which more than 10 trillion-level tables are all stored in AnalyticDB.

Challenge in data writing performance: 60 billion new rows of writes per day with a total size of 10TB, in which the baseline task completes the import within two hours from 7:00 to 9:00 every morning, requiring a tps with a speed of at least 10 million.

Complex query performance challenges: complex query types, mostly large complex interaction analysis, 2 trillion large tables are filtered and then associated with 8 tables, which need to be returned within 10 seconds.

Export performance challenges: fast result set export, business needs to analyze the circle of people for data export. You need to be able to export the calculated results of AnalyticDB to Maxcompute easily and efficiently. And it is required to be able to support more than 20 million-level result export tasks per minute.

Cost challenge: considering the PB-level data, the cost of storing all the data in SSD is too high, so we want the system to have the ability to store hot and cold data in layers and achieve online performance at a price close to offline.

Stability challenges: complex workloads, in real-life online scenarios, the three challenges mentioned above will occur at the same time, requiring the system to run smoothly under such a complex workload.

Through the previous technical research and testing, AnalyticDB is selected as the basic platform for data banking business analysis, as follows:

1. Hot and cold data layering ability

The enterprise-level feature of hot and cold separation of data provided by AnalyticDB can greatly improve the performance-to-price ratio of data storage. You can select hot meters (stored in ESSD), cold meters (stored in OSS) and temperature meters (mixed mode, some stored in ESSD and partly stored in OSS) according to the granularity of the table. Customers can freely specify according to their business needs, and hot and cold policies can be changed arbitrarily, which is a storage and a set of syntax for users to easily achieve federated query. Most of the scenarios we use are cold tables, and AnalyticDB has SSD Cache for cold tables to speed up, reducing costs while taking into account performance. The data bank stores tens of trillions of pieces of data in AnalyticDB, occupying about 2p of storage space, which has become the core warehouse storage of the data bank, and is expected to continue to grow in the future.

two。 High throughput real-time write

The bottom layer of AnalyticDB adopts distributed parallel architecture to achieve extremely high write / import throughput, and massive data can be written directly to tens of millions or even billions of tps in real time. At the same time, for offline aggregated tables, AnalyticDB provides a cut-through batch load way to load data directly for online query with high throughput.

Strong real-time computing capability with high concurrency and low latency.

There is a strong demand for response time in the three types of business query scenarios. Most of the queries are trillions of bezel selected aggregation and join queries. AnalyticDB uses cold data caching, preheating and other technologies, so that the average response time of these queries is less than 10 seconds.

Data model and table design

Data banks mainly store four types of data in AnalyticDB:

1. AIPL data, that is, brand and consumer relationship

2. Label data, that is, consumer attributes

3. Contact data, that is, consumer behavior

4. Population data, that is, the population of data bank users deposited in the data bank.

Because the analysis object of all scenarios is the consumer ID, most tables use the consumer ID as the distribution key, which can minimize the data shuffle (redistribution) in the query process. The following mainly introduces the table design of AIPL and label, and the table design of contact is similar to that of people, so I won't repeat it.

1. AIPL data

AIPL data is partitioned by day, but due to the large amount of data generated every day (more than 50 billion), even if AnalyticDB batch import performance is outstanding, it still takes a long time to import. Considering that AnalyticDB does not support multi-level partitioning, we split the AIPL table into 20 sub-tables from the brand dimension to improve import performance on the one hand and query performance on the other.

In addition to providing AIPL analysis in the brand dimension, data banks can also drill down to the second-level category dimension. There are two solutions to support the second-level category:

1. Expand the secondary category dimension on the original AIPL table.

2. Add a set of AIPL tables with secondary category dimensions.

The first scheme saves more storage space and only needs to import a set of tables; the second scheme has better query performance.

After verification, we use the second scheme, the query without secondary category accounts for a large proportion in the data bank, when the query does not include secondary category, the first scheme requires group by consumer ID, the execution process will take up more memory, the concurrency that can be carried is lower, and the performance is poor. Thanks to the lower storage costs of AnalyticDB, the use of the second option did not result in a significant increase in costs.

At the same time, because brand ID is a necessary dimension when querying, and AIPL status is a dimension used frequently, setting brand ID and AIPL status to aggregate columns when building tables can effectively reduce query IO and improve query performance.

The table structure of the two sets of tables is as follows (schematic):

-- aipl table CREATE TABLE `aipl_ [001,020] `(`customer_ id` bigint, `brand_ id` bigint, `aipl_ status` int, `day` bigint) DISTRIBUTE BY HASH (`brand_ id`) PARTITION BY VALUE (day) CLUSTERED BY (`brand_ id`, `aipl_ status`) without secondary category dimensions-- aipl table CREATE TABLE `aipl_cate_ [001-020]` (`customer_ id` bigint, `brand_ id` bigint, `cate_ id` bigint, `aipl_ status`) `day` bigint) DISTRIBUTE BY HASH (`dayid`) PARTITION BY VALUE (day) CLUSTERED BY (`brand_ id`, `cate_ id`, `aipl_ status`)

2. Label

In general, due to the existence of multi-value tags (for example, a consumer can have multiple interests), the tag table will be designed as a kv structure, as shown below:

CREATE TABLE `tag` (`customer_ id` bigint, `tag_ key` int, `tag_ value` int) DISTRIBUTE BY HASH (`customer_ id`)

However, the data bank can choose multiple tags for intersection and difference in the crowd circle, and the use of kv structure will lead to the poor performance of consumer ID sets with multiple tag values in memory.

Using the multi-valued column of AnalyticDB (multivalue/ or json column), the tag table of the data bank has the following table structure (schematic):

CREATE TABLE `tag` (`customer_ id` bigint, `tag1` int, `tag2` int, `tag3` multivalue,.) DISTRIBUTE BY HASH (`tagid`)

The intersection and union difference of multiple tags is converted into the AND / OR relationship of the tag table at the bottom. However, due to more tags, more than 200 columns of the table is not only slow in the import performance, but also due to fill in more blank values (customer_id in a tag if there is no value, will fill a specific value) lead to data expansion is more serious, so similar to the split of AIPL, the label table is also split into more than a dozen tables according to different topics.

Acceleration of crowd circle selection

The data bank will solidify the crowd selected by the user (that is, save the consumer ID list) for subsequent operations. Because the crowd selection will involve the intersection and union difference of dozens of subqueries, the circle selection time is long, and the intermediate result may be very large, so the data bank will divide a circle selection into multiple query fragments and send them to ADB for execution. Finally, the query results of each fragment (consumer ID list) will be intersected and subdivided in ETL. Complete the crowd selection.

The whole process of crowd circle selection makes full use of the indexing ability of ADB and the ability of mixed load off-line, which can not only accelerate the speed of crowd circle selection, but also improve the overall utilization of resources, especially the queries with high conditional screening rate (such as the crowd circle selection involving AIPL). At the same time, the cloud original projectile performance of ADB can also easily cope with the peak pressure of Singles Day holiday.

Business value

Generally speaking, the business value of AnalyticDB to data banks is mainly reflected in the following aspects:

1. High-performance OLAP engine: under the background that the data bank has up to 22 trillion pieces of data and occupies 1.6p storage space, it achieves an average query response of 3-5 seconds, supports the product implementation of data bank second-level OLAP, and provides users with flexible and efficient analysis tools.

2. Cost reduction: after using AnalyticDB hot and cold tiered storage + pay-by-usage mode, the use cost of data is greatly reduced while ensuring performance, which is about 46% lower than that of the previous generation version.

3. The resilience of data banks is based on the mixed crowd selection mode implemented by AnalyticDB. During the promotion period, using the cloud native elasticity of AnalyticDB, you can quickly expand resources to cope with the peak.

This is the end of how AnalyticDB supports the ultra-large-scale and low-cost real-time analysis of data banks. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.