Introduction of Apache Doris data Model 04/18 Update SLTechnology News&Howtos

Introduction of Apache Doris data Model

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "the introduction of the Apache Doris data model". In the daily operation, I believe that many people have doubts about the introduction of the Apache Doris data model. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "the introduction of the Apache Doris data model". Next, please follow the editor to study!

Basic concept

In Doris, data is logically described in the form of a Table. A table includes rows (Row) and columns (Column). Row is a row of data for the user. Column is used to describe different fields in a row of data.

Column can be divided into two main categories: Key and Value. From a business perspective, Key and Value can correspond to dimension columns and metric columns, respectively.

The data model of Doris is divided into three main categories:

Duplicate detail model

Aggregate aggregation model

Unique unique primary key model

Let's introduce each of them.

Duplicate detail model

The detail model is the data model used by Doris by default. The data model does not do any processing on the imported data. The data in the table is the original data imported by the user.

ColumnNameTypeSortKeyCommenttimestampDATETIMEYes log time typeINTYes log type error_codeINTYes error code error_msgVARCHAR (1024) No error details op_idBIGINTNo responsible idop_timeDATETIMENo processing time

The statement of the table is as follows:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl (`timestamp` DATETIME NOT NULL COMMENT "log time", `type` "INT NOT NULL COMMENT", `timestamp` INT COMMENT "error code", `error_ msg` VARCHAR (1024) COMMENT "error details", `timestamp` BIGINT COMMENT "responsible id", `op_ time` DATETIME COMMENT "processing time") DUPLICATE KEY (`timestamp`, `type`). / * omit Partition and Distribution information * /

The DUPLICATE KEY specified in the table-building statement is only used to indicate which columns the underlying data is sorted by. (a more apt name would be "Sorted Column", where the name "DUPLICATE KEY" is just a clear indication of the data model used. For more explanation of "Sorted Column", index the document). In the selection of DUPLICATE KEY, we recommend that the first 2-4 columns should be selected appropriately.

This data model is suitable for the storage of raw data with neither aggregation requirements nor primary key uniqueness constraints. At the same time, users can also build aggregate views on the basis of this model through the materialized view function, so it is a recommended data model.

Aggregate aggregation model

The aggregation model requires users to explicitly divide the columns into Key columns and Value columns when creating the table. The model automatically aggregates the same Key rows on the Value column.

We use practical examples to illustrate what the aggregation model is and how to use it correctly.

Example 1: import data aggregation

Suppose the business has the following datasheet schema:

ColumnNameTypeAggregationTypeCommentuser_idLARGEINT

User iddateDATE

Date of data input cityVARCHAR (20)

AgeSMALLINT in the city where the user is located

User age sexTINYINT

User gender last_visit_dateDATETIMEREPLACE user last visit time costBIGINTSUM user total consumption max_dwell_timeINTMAX user maximum stay time min_dwell_timeINTMIN user minimum stay time

If converted to a table-building statement, it is as follows (omitting the Partition and Distribution information in the table-building statement)

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl (`city` LARGEINT NOT NULL COMMENT "user id", `date` DATE NOT NULL COMMENT "data input date and time", `city` VARCHAR (20) COMMENT "user city", `age`SMALLINT COMMENT "user age", `sex` TINYINT COMMENT "user gender", `last_visit_ date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT last visit time " `cost` BIGINT SUM DEFAULT "0" COMMENT "total user consumption, `max_dwell_ time` INT MAX DEFAULT" 0 "COMMENT" maximum user stay time ", `min_dwell_ time` INT MIN DEFAULT" 99999 "COMMENT" user minimum stay time ",) AGGREGATE KEY (`user_ id`, `date`, `timestamp`, `city`, `age`, `sex`). / * omit Partition and Distribution information * /

As you can see, this is a typical fact table of user information and access behavior. In the general star model, user information and access behavior are generally stored in the dimension table and fact table respectively. Here, in order to explain the data model of Doris more conveniently, we store the two parts of information in one table.

The columns in the table are divided into Key (dimension column) and Value (metric column) according to whether AggregationType is set or not. AggregationType is not set, such as user_id, date, age. And so on are called Key, and those with AggregationType set are called Value.

When we import data, the same rows for the Key column are aggregated into a row, while the Value column is aggregated according to the set AggregationType. AggregationType currently has the following four aggregation methods:

SUM: summation, multi-line Value for accumulation.

REPLACE: instead, the Value in the next batch of data replaces the Value in the previously imported row.

MAX: keep the maximum value.

MIN: keep the minimum value.

Suppose we have the following imported data (raw data):

User_iddatecityagesexlast_visit_datecostmax_dwell_timemin_dwell_time100002017-10-01 Beijing 2002017-10-01 06PUBG 00RV 002010100002017-10-01 Beijing 2002017-10-01 07RV 00RV 1522100012017-10-01 Beijing 3012017-10-01 1705VLV 4522222100022017-10-02 Shanghai 2012017-10-02 1220055100032017-10-02 Guangzhou 32055100032017-10-02 Guangzhou 3205100032017-10-02 Guangzhou 32011100042017-10-01 Shenzhen 3502017-10-01 Shenzhen 3502017-10-01 10RV 00001510033100042017-10-03 Shenzhen 3502017-10-03RV 20221166

Let's assume that this is a table that records the behavior of users visiting a product page. Let's take the first row of data as an example and explain it as follows:

The data shows that 10000 users id, each user uniquely identifies id2017-10-01 data entry time, accurate to the date of Beijing user city 20 user age 0 gender (1 represents female) 06:00:00 user visit time of this page, accurate to 20 seconds of user consumption generated by this visit 10 users of this visit, resident time of 10 users of this visit Time to reside on the page (redundant)

Then when the data is correctly imported into Doris, the final storage in Doris is as follows:

User_iddatecityagesexlast_visit_datecostmax_dwell_timemin_dwell_time100002017-10-01 Beijing 2002017-10-01 07Viru 00Rd 0035102100012017-10-01 Beijing 3012017-10-01 1705VOV 45222100022017-10-02 Shanghai 2012017-10-02 121555100032017-10-02 Guangzhou 3202017-10-02 1114042017-10-01 Shenzhen 350100042017-10-01 1010010033100042017-10-03 Shenzhen 3502017-10-03 Shenzhen 3502017-03 20RV 221166

As you can see, user 10000 has only one row of aggregated data left. The data of other users is consistent with the original data. Let's first explain the aggregated data of user 10000:

The first five columns remain unchanged, starting with column 6 last_visit_date:

2017-10-01 07:00:00: because the last_visit_date columns are aggregated as REPLACE, 2017-10-01 07:00:00 replaced 2017-10-01 06:00:00 and saved.

Note: for the data in the same import batch, the replacement order is not guaranteed for the aggregation of REPLACE. If in this example, the final preservation, may also be 2017-10-01 06:00:00. For the data in different import batches, it can be guaranteed that the data of the latter batch will replace the previous batch.

35: because the aggregate type of the cost column is SUM, 35 is accumulated from 20 + 15.

10: because the aggregation type of the max_dwell_time column is MAX, take the maximum values of 10 and 2, and get 10.

2: because the aggregate type of the min_dwell_time column is MIN, take the minimum values of 10 and 2 to get 2.

After aggregation, only the aggregated data will eventually be stored in the Doris. In other words, the detail data will be lost and the user can no longer query the detail data before the aggregation.

Example 2: keep detail data

Following example 1, we modify the table structure as follows:

ColumnNameTypeAggregationTypeCommentuser_idLARGEINT

User iddateDATE

Date of data inputting timestampDATETIME

Data filling time, accurate to seconds cityVARCHAR (20)

AgeSMALLINT in the city where the user is located

User age sexTINYINT

User gender last_visit_dateDATETIMEREPLACE user last visit time costBIGINTSUM user total consumption max_dwell_timeINTMAX user maximum stay time min_dwell_timeINTMIN user minimum stay time

That is, a column timestamp is added to record the data filling time accurate to seconds.

The imported data is as follows:

User_iddatetimestampcityagesexlast_visit_datecostmax_dwell_timemin_dwell_time100002017-10-012017-10-01 08:00:05 Beijing 2002017-10-01 06Groupe 00002010100002017-10-012017-10-01 09:00:05 Beijing 2002017-10-01 07VOGO 00001522100012017-10-012017-10-01 18:12:10 Beijing 3012017-10-01 17purl 0545222100022017-10-022017-10-02 13:10:00 Shanghai 2012017-10-022017-10-02 121255100032017-10-022017. 10-02 13:15:00 Guangzhou 3202017-10-02 11 purl 20purl 203011100042017-10-012017-10-01 12:12:48 Shenzhen 3502017-10-01 10VOV 0000Vantage 1510033100042017-10-032017-10-03 12:38:20 Shenzhen 3502017-10-032017-10-03 12:38:20 Shenzhen 3502017-10-03 10 Swiss 20mer 221166

Then when the data is correctly imported into Doris, the final storage in Doris is as follows:

We can see that the stored data is exactly the same as the imported data, without any aggregation. This is because in this batch of data, the Key of all rows is not exactly the same because of the addition of the timestamp column. That is, as long as you ensure that the Key of each row in the imported data is not exactly the same, then even under the aggregation model, Doris can keep the complete detail data.

Example 3: import data and existing data aggregation

Follow example 1. Suppose you already have the following data in the table:

Let's import a new batch of data:

User_iddatecityagesexlast_visit_datecostmax_dwell_timemin_dwell_time100042017-10-03 Shenzhen 3502017-10-03 11 purl 2200441919100052017-10-03 Changsha 2912017-10-03 18purl 1102311

Then when the data is correctly imported into Doris, the final storage in Doris is as follows:

User_iddatecityagesexlast_visit_datecostmax_dwell_timemin_dwell_time100002017-10-01 Beijing 2002017-10-01 07Viru 00Rd 0035102100012017-10-01 Beijing 3012017-10-01 1705Rd 45222100022017-10-02 Shanghai 2012017-10-02 2012017-10-02 1220055100032017-10-02 Guangzhou 3202017-10-02 1120 Vista 003011100042017-10-01 Shenzhen 3502017-10-01 1000000015 10033100042017-10-03 Shenzhen 3502017-10 03 11 Switzerland 22pur0055196100052017-10-03 Changsha 2912017-10 03-18rel 2912017-1003-11011

You can see that 10004 of the user's existing data and newly imported data are aggregated. At the same time, the data of 10005 users have been added.

There are three stages of data aggregation in Doris:

The ETL phase of each batch of data import. This phase aggregates within the imported data of each batch.

The phase in which the underlying BE performs data Compaction. At this stage, BE further aggregates the data of different batches that have been imported.

Data query phase. In the data query, the data involved in the query will be aggregated accordingly.

The degree of aggregation of the data may be inconsistent at different times. For example, when a batch of data is first imported, it may not have been aggregated with pre-existing data. But for users, users can only query the aggregated data. That is, different degrees of aggregation are transparent to user queries. The user should always assume that the data exists at the final degree of aggregation, and should not assume that some aggregation has not taken place. (see the limitations of the aggregation model section for more details. )

Unique unique primary key model

In some multi-dimensional analysis scenarios, users are more concerned about how to ensure the uniqueness of Key, that is, how to obtain Primary Key uniqueness constraints. Therefore, we introduce the data model of Unique. This model is essentially a special case of the aggregation model and a simplified representation of the table structure. Let's give an example.

ColumnNameTypeIsKeyCommentuser_idBIGINTYes user idusernameVARCHAR (50) Yes user nickname cityVARCHAR (20) No user city ageSMALLINTNo user age sexTINYINTNo user gender phoneLARGEINTNo user phone addressVARCHAR (500) No user address register_timeDATETIMENo user registration

This is a typical table of basic user information. There is no need for aggregation of this kind of data, just to ensure the uniqueness of the primary key. (the primary key here is user_id + username). Then our table creation statement is as follows:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl (`username` LARGEINT NOT NULL COMMENT "user id", `username` VARCHAR (50) NOT NULL COMMENT "nickname", `city` VARCHAR (20) COMMENT "user city", `age`SMALLINT COMMENT "user age", `sex`TINYINT COMMENT "user gender", `phone` LARGEINT COMMENT "user phone", `address` VARCHAR (500) COMMENT "user address" `Usertime` DATETIME COMMENT "user registration time") UNIQUE KEY (`user_ id`, `user_ name`)... / * omit Partition and Distribution information * /

This table structure is exactly the same as the following table structure described using the aggregation model:

ColumnNameTypeAggregationTypeCommentuser_idBIGINT

User idusernameVARCHAR (50)

User nickname cityVARCHAR (20) REPLACE user city ageSMALLINTREPLACE user age sexTINYINTREPLACE user gender phoneLARGEINTREPLACE user phone addressVARCHAR (500) REPLACE user address register_timeDATETIMEREPLACE user registration time

And build a table sentence:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl (`username` LARGEINT NOT NULL COMMENT "user id", `username` VARCHAR (50) NOT NULL COMMENT "nickname", `city` VARCHAR (20) REPLACE COMMENT "user city", `age`SMALLINT REPLACE COMMENT "user age", `sex`TINYINT REPLACE COMMENT "user gender", `phone` LARGEINT REPLACE COMMENT "user phone", `address` VARCHAR (500) REPLACE COMMENT "user address" `Usertime` DATETIME REPLACE COMMENT "user registration time") AGGREGATE KEY (`user_ id`, `user_ name`)... / * omit Partition and Distribution information * /

That is, the Unique model can be completely replaced by the REPLACE method in the aggregation model. Its internal implementation and data storage are exactly the same. I will not go on with examples here.

Limitations of aggregation model

Aggregation model (including Unique model), through a way of pre-calculation to reduce the amount of data that needs to be calculated in real time, and speed up the query. However, this model will have limitations in use.

In the aggregation model, the model shows the final aggregated data. In other words, any data that has not yet been aggregated (such as data from two different imported batches) must be presented in some way to ensure consistency. Let's give an example.

Suppose the table structure is as follows:

ColumnNameTypeAggregationTypeCommentuser_idLARGEINT

User iddateDATE

Date of data inputting total consumption of costBIGINTSUM users

Suppose there are two batches of data in the storage engine that have been imported:

Batch 1

User_iddatecost100012017-11-2050100022017-11-2139

Batch 2

User_iddatecost100012017-11-201100012017-11-215100032017-11-2222

As you can see, the data belonging to user 10001 in the two import batches has not yet been aggregated. However, in order to ensure that users can only query the following final aggregated data:

User_iddatecost100012017-11-2051100012017-11-215100022017-11-2139100032017-11-2222

We add an aggregation operator to the query engine to ensure the external consistency of the data.

In addition, on the aggregation column (Value), you should pay attention to the semantics when executing aggregation queries that are inconsistent with the aggregation type. For example, we execute the following query in the example above:

SELECT MIN (cost) FROM table

The result is 5, not 1.

At the same time, this consistency ensures that in some queries, the query efficiency will be greatly reduced.

Let's take the most basic count (*) query as an example:

SELECT COUNT (*) FROM table

In other databases, such queries quickly return results. Because in the implementation, we can get the query results with little overhead, such as "count rows on import and save count statistics", or "scan only a column of data to get count values" when querying. But in Doris's aggregation model, this kind of query is very expensive.

Let's take the data just now as an example:

Batch 1

User_iddatecost100012017-11-2050100022017-11-2139

Batch 2

User_iddatecost100012017-11-201100012017-11-215100032017-11-2222

Because the final aggregate result is:

User_iddatecost100012017-11-2051100012017-11-215100022017-11-2139100032017-11-2222

Therefore, the correct result for select count (*) from table; should be 4. But if we scan only the user_id column, if we add the aggregation when the query is added, the final result is 3 (10001, 10002, 10003). If you aggregate without a query, the result is 5 (five rows of data in the two batches). It can be seen that both of these results are wrong.

In order to get the correct result, we must read the data of both the user_id and date columns at the same time, and then aggregate the query to return the correct result of 4. That is, in the count () query, Doris must scan all AGGREGATE KEY columns (in this case, user_id and date) and aggregate to get semantically correct results. When there are too many aggregate columns, the count () query needs to scan a large amount of data.

Therefore, when there are frequent count (*) queries in the business, we recommend that users simulate count (*) by adding a column with a constant value of 1 and an aggregate type of SUM. Like the table structure in the example just now, we modify it as follows:

ColumnNameTypeAggregateTypeCommentuser_idBIGINT

User iddateDATE

Date of data input costBIGINTSUM Total user consumption countBIGINTSUM is used to calculate count

Add a count column, and import the data, the column value is always 1. Then the result of select count (*) from table; is equivalent to select sum (count) from table;. The query efficiency of the latter will be much higher than that of the former. However, there is also a limitation in this approach, that is, users need to guarantee that they will not import rows with the same AGGREGATE KEY columns over and over again. Otherwise, select sum (count) from table; can only express the number of rows that were originally imported, not the semantics of select count (*) from table;.

Another way is to change the aggregation type of the count column above to REPLACE, and still have a constant value of 1. Then the results of select sum (count) from table; and select count (*) from table; will be consistent. And in this way, there is no restriction on importing duplicate rows.

Duplicate model

The Duplicate model does not have this limitation of the aggregation model. Because the model does not involve aggregate semantics, when doing count (*) queries, we can get correct semantic results by arbitrarily selecting a column of queries.

Suggestions for the selection of data models

Because the data model is determined when the table is created and cannot be modified. Therefore, it is very important to choose an appropriate data model.

Through pre-aggregation, the Aggregate model can greatly reduce the amount of scanned data and the computational complexity of the aggregate query, so it is very suitable for report query scenarios with fixed patterns. But the model is not friendly to count (*) queries. At the same time, because the aggregation mode on the Value column is fixed, semantic correctness needs to be considered when doing other types of aggregation queries.

Unique model can guarantee the uniqueness of primary key constraints for scenarios where unique primary key constraints are needed. However, you cannot take advantage of the query advantages of pre-aggregation such as ROLLUP (because it is essentially REPLACE, there is no such aggregation method as SUM).

Duplicate is suitable for Ad-hoc queries of any dimension. Although it is also impossible to take advantage of the features of pre-aggregation, it is not constrained by the aggregation model and can take advantage of the column storage model (only relevant columns are read, not all Key columns).

At this point, the study of "introduction to the Apache Doris data model" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.