What are the design specifications of MaxCompute table 07/02 Update SLTechnology News&Howtos

What are the design specifications of MaxCompute table

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

MaxCompute table design specifications, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Restrictions for the table

Table (Table) Design specifications the main objectives of table design

Selection of Partition Field and ordinary Field

Partition fields are defined according to

The definition of the number of zones is based on

Recommendations for number of partitions and amount of data

Standardize the data life cycle by data tiering

According to the change of data and the preservation of historical specification data

The influence of table design

Table design steps

Table data storage specification

Data Import Channel and Table Design

Correspondence between Partition Design and logical Storage

Basic rules for table and partition design

Zoning design

Restrictions for the table

Cdn.com/a774991fbfcd827098aeed14ae60230b787b1fcd.png ">

Table (Table) Design specifications the main objectives of table design

Reduce storage costs. Appropriate table design can reduce redundant storage in data hierarchical design and reduce the amount of data in intermediate tables. At the same time, the correct life cycle management of table data can directly reduce the amount of stored data and reduce the storage cost.

Reduce the cost of calculation. Standardize the design of the table so that in the subsequent process of query and calculation of the table data, we can optimize the reading of the data according to these specifications, reduce the redundant reading, writing and calculation in the calculation process, improve the computing performance and reduce the cost.

Reduce the complexity of maintenance. Standardized hierarchical table design can directly reflect the characteristics of the business. For example, by optimizing the data acquisition mode in the data channel and standardizing the design of the table, we can reduce the problem of small files in the distributed system, but also reduce the complexity of the number of tables and partition maintenance.

The influence of table design

Affected operations: table creation / input data / table update / table deletion / table management. Import data scenarios (distinguish between real-time data collection and offline batch data writing):

Import is query and calculation.

Multiple imports, regular query and calculation.

After import, an intermediate table is generated for calculation.

Note:

Reasonable table design and data integration cycle management can reduce the cost of data during storage. -MaxCompute is preferred as a batch data integration library and calculated by business logic, such as by partition.

Query and calculation immediately after import, you need to consider the amount of data imported each time to reduce the flow of small data import.

Unreasonable data import and storage (small files) will affect the overall storage performance, computing performance, operation and maintenance stability.

Table design steps

Determine the project space, according to the business process planning table type, which data level it belongs to.

Definition table description, permission definition and Owner definition.

Define partition table or non-partition table according to the amount of data and the characteristics of data integration.

Define fields, or partition fields

Table creation / table conversion

Identify the relevant factors of the import data scenario (including bulk data writing / streaming data writing / bar data insertion).

Define the lifecycle of table and partition data.

Note:

After the table is created, you can modify the table schema according to business changes, such as setting the life cycle, RangeClustering.

Special attention needs to be paid to distinguishing data scenarios during the design phase (bulk data writing / streaming data writing / periodic bar data insertion).

Rational use of non-partitioned tables and partitioned tables. Log tables, fact tables, original collection tables, etc., are recommended to use partition tables, partitioned according to time.

Pay attention to the constraints of various tables and partitions.

Table data storage specification data life cycle by data layering specification

Source table ODS layer: all the data synchronized from the business system every day is retained, and the life cycle definition is permanently saved. In case downstream data is corrupted, it can be recovered from ODS. If ODS synchronizes the full scale every day, it can be compressed and stored by zipper.

Data warehouse (basic) layer: keep at least one complete copy of the total data (there is no need to redundant multiple copies like ODS). For performance reasons, you can consider dismantling tables or partitioning.

Data Mart layer: keep it for 1 "3 years" as needed. The data of the data Mart is easy to generate, so there is no need to retain historical data of such a long time.

According to the change of data and the preservation of historical specification data

Will change how the data is stored:

Customer attributes, product attributes change every day, record the historical changes of these attributes, in order to facilitate the tracing of the value of a certain point in time.

The field of the redundant dimension table in the fact table, that is, bind the various dimension attribute values of "at the time of the event" to the event. It is more convenient for users to use data without associating multiple tables and use it in the data application layer.

Record the changes in the dimension table in the form of a zipper table or a daily snapshot. It is more convenient for data processors, the data structure is flexible, the expansion side is convenient, easy to manage, and the data consistency is better. Used in the data base layer.

Data Import Channel and Table Design

Channel type:

Datahub, plan the relationship between write partitions and write traffic, and achieve 64m commit at once.

Data integration or DataX, plan to write the frequency of table partitions, so that 64m commit at a time to avoid commit empty directory. DTS, plan the relationship between the written table stock partition and the incremental partition, and set the commit frequency.

Console (Run SQL or Tunnel upload) to avoid inserting or uploading files with high frequency and small amount of data.

For the insert into of SDK Run Sql, when uploading tables or partitions, you need to pay attention to sorting out small files after inserting them into a partition, so as to avoid inserting a partition or non-partition table multiple times. Merge is required after insertion.

Note:

The only channel for MaxCompute to import data is Tunnel SDK or Insert into that executes SQL to avoid streaming insertion.

Each of the above channels has its own logic for streaming data writing, batch data writing, periodic scheduling writing.

When writing a table or partition in a data channel, it is necessary to control the amount of data written at a time to a reasonable value such as 64m.

Correspondence between Partition Design and logical Storage

As shown in the figure above, the table has a total of m first-level partitions, each first-level partition stores the second-level partition in time, and each second-level partition stores all the columns. Considerations for designing partitions:

Limit the number of partitions.

Avoid only a small amount of data in each partition.

Query and calculate according to partition conditions.

Avoid multiple data writes in each partition.

Basic rules for table and partition design

All table and field names should use a unified naming convention.

Be able to distinguish the business types of the table.

It is necessary to be able to distinguish whether the table is a "fact table" or a "dimension table", a "log table", and a "limit storage table" (to be released).

Be able to distinguish the entity information of the table.

Fields with the same business meaning in different tables need to define a unified data type:

Avoid unnecessary type conversion.

General rules for zoning design and use:

New partitions are supported, but new partition fields are not supported.

The number of partitions supported by a single table is 60,000.

For a multi-level partitioned table, if you want to add a new partition, you must specify all the partition values.

Changing the partition column name is not supported, only the corresponding value of the partition column can be modified. Modify one or more partition values of a multi-level partition, and the partition value of each level of the multi-level partition must be written.

Selection of Partition Design Partition Field and ordinary Field

The role of the partition field:

Facilitate the management of data.

Divide the scope of data scanning.

When you create a table, you can set normal fields and partition fields. In most cases, the ordinary field can be understood as the data of the data file, while the partition field can be understood as the directory of the file system. The storage space of the table is occupied by ordinary fields. Although the partition column does not store data directly, it is like the directory in the file system, which is convenient for data management. at the same time, if a specific partition is specified in the calculation, only the corresponding partition is queried in the calculation process, so as to reduce the calculation input. The number of partition columns in a partition table cannot exceed 6 levels, which can also be understood to mean that the number of directory layers in which the underlying data is stored cannot exceed 6 layers. By setting the appropriate life cycle of the partition table, some data can be managed according to the fine granularity of the partition.

Note:

The corresponding fields can be set to partition fields in terms of the scope of data management and the commonly used range of data scanning.

Set to normal fields for fields that do not have rules or whose number of types is greater than 10000 and are not often used as query conditions.

Partition fields are defined according to

Sort by priority:

The selection of district column should fully consider the time factor and avoid updating the stock partition as far as possible.

If there are multiple fact tables (excluding dimension tables) for join, the columns of the query condition where range are used as partitioned columns. Select

Select the column contained in group by or distinct as the partition column.

Select columns with evenly distributed values, and do not select columns with sloped partitions as partition columns.

Commonly used SQL contains the equivalent of a column or in query criteria, and select that column as the partition column.

For example:

Select... From table where id=123 and.; the number of partitions is defined according to.

Time zoning: can be partitioned by day or by month. If partitioned by hour, the average number of secondary partitions should not be more than 8.

Regional zoning: provinces, cities and counties are divided, and multi-level zoning is considered. 23 provinces, 5 autonomous regions, 4 municipalities directly under the Central Government, 2 special administrative districts; 50 regions (prefectures and leagues); 661 cities, including 4 municipalities directly under the Central Government; 283 prefecture-level cities; 374county-level cities; 1636 counties (autonomous counties, banners, autonomous banners, special zones and forest areas), zoning at the finest granularity at the county level should no longer be divided on an hourly basis.

The data under a single partition is recommended to submit 64m data once. If it is a multi-level partition, make sure that the data of the secondary partition under each finest partition is in accordance with this rule.

The number of partitions in a single table (including subordinate partitions) cannot exceed 60,000.

Recommendations for number of partitions and amount of data

It is an advantage of partitioning to use partition clipping when calculating.

It is recommended that the amount of data in a single partition is not too large. For example, there can be 10,000 data in a single partition, but 50,000 partitions have been built.

The tilt of partition data should be avoided as far as possible. The difference in the amount of data between different partitions of a single table has been checked by more than 1 million.

The number of partitions should be planned reasonably when designing partitions. Fine-grained partitions will affect the performance of SQL when scanning across partitions.

In the case of a large amount of data in a single partition, MaxCompute will do sharding processing when performing tasks, which will not affect the advantage of partition clipping.

When there are a large number of pieces in a single partition, it will affect the number of MaxComputeInstance, resulting in a waste of resources and the impact of SQL performance.

Multi-level partitioning is adopted, first by date, then by transaction type.

When you open a table, a transaction type is divided into a separate table, and each table is partitioned by date.

Dimension tables are not partitioned.

After reading the above, have you mastered the methods of the MaxCompute watch design specification? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.