What is the design method of MaxCompute table 04/06 Update SLTechnology News&Howtos

What is the design method of MaxCompute table

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "what is the design method of MaxCompute table". In the daily operation, I believe that many people have doubts about what the design method of MaxCompute table is. The editor consulted all kinds of data and sorted out a simple and easy-to-use operation method. I hope it will be helpful to answer the doubt of "what is the design method of MaxCompute table?" Next, please follow the editor to study!

The best practice of MaxCompute table design produces a large number of small files

Small files in MaxCompute tables can affect storage and computing performance, so let's first introduce what operations will produce a large number of small files, and consider avoiding such operations when designing tables.

Use MaxCompute Tunnel SDK to upload data, and one file is generated for each commit during the upload process. If each file is too small (such as a few kilograms) and uploaded frequently (such as 5 seconds), 720 small files will be generated in an hour, and 17280 small files will be generated in a day.

Using MaxCompute Tunnel SDK to upload data, create session but without upload data directly do commit, resulting in a large number of empty directories (the service side is equivalent to small files).

When uploading using the MaxCompute Console command line tool Tunnel, the local large files are divided into too small, resulting in too many files and too small files after upload.

When data is archived through DataHub, there are two conditions for each shard of Datahub to write MaxCompute: the total amount of data reaches 64MB commit to MaxCompute at a time, forming a file. Or commit every 5 minutes to form a file. So: open a large number of shard (for example, 20 shard), each shard data in 5 minutes are far less than 64m, compared to hundreds of K, will produce a large number of small files. Then 241,220-5760 small files will be produced in one day.

When data incremental insertion (insert into) into MaxCompute tables (or table partitions) is carried out through data development tools such as Dataworks, each insert into will generate a file. If there are 10 insert into entries each time and a total of 10000 insert insert records per day, 1000 small files will be generated.

When synchronizing data from a database such as RDS to MaxCompute,DTS through Aliyun DTS, a full scale and an increment table will be created. In the process of inserting data into the increment table, commit will have a complete data synchronization because there are fewer pieces of data inserted each time, thus causing small file problems in the increment table. For example, synchronization is performed every 5 branches, with 10 pieces of data synchronized each time, with an increment of 10000 pieces in a day. 1000 small files are generated. In this scenario, the merge of the full limit table and incremental data table is required after the data synchronization is completed.

There are too many source data collection clients, the source data directly enters a partition through T unnel, and each source data collection client submits data once, it will produce an independent file under the same partition, resulting in a large number of small files.

SLS triggers FunctionCompute to access files to the MaxCompute center at a continuous high frequency, and small files stream data into MaxCompute.

Divide project space according to data

Project space (Project) is the object at the highest level of MaxCompute, which allocates, isolates and manages resources according to project space, realizing the management ability of multi-tenancy.

If multiple applications need to share "data", it is recommended to use the same project space.

If the "data" required by multiple applications are irrelevant, it is recommended to use different project spaces. Tables and partitions between project spaces can be exchanged by Package authorization.

Best practices for Dimension Table Design:

In general, tables that describe attributes are designed as dimension tables. Dimension tables can be associated with any table of any table group, and no partition information is required for creation, but there is a limit on the amount of data in a single table. The design and use of dimension tables should pay attention to the following points:

It is generally required that the dimension form table does not exceed 10 million.

The data of the dimension table should not be updated massively.

Mapjoin can be used for Join operations on dimension tables and other tables.

Zipper Table Design-Application of limit Storage

The extreme storage function is to be released, and the design idea is mainly provided in this introduction. Background of zipper table design based on MaxCompute in the data model design process of data warehouse, we often encounter such requirements:

The amount of data is relatively large. Some of the fields in the table will be update, such as the user's address, product description, order status, mobile phone number, and so on.

You need to view the historical snapshot information of a certain time point or time period. (for example, check the status of an order at a certain point in history, for example, check how many times a user has updated in a certain period of time, etc.)

The proportion and frequency of changes are not very large. For example, there are a total of 10 million members, and about 100000 of them are added or changed every day. If a full copy of the table is retained every day, a lot of constant information will be saved each time, which is a great waste of storage.

Consider the use of limit storage: MaxCompute provides the ability to convert different tables into limit storage tables. Examples of limit storage operations are as follows:

Create the source table.

Create table src_tbl (key0 STRING, key1 STRING, col0 STRING, col1 STRING, col2 STRING) PARTITIO N (datestam Playx STRING, pt0 STRING)

Import data.

Turn src_tbl into a table of extreme storage.

Set odps.exstore.primarykey=key0,key1; [set odps.exstore.ignorekey=col0;] EXSTO RE exstore_tbl PARTITIO N (datestam pendant 20140801'); EXSTO RE exstore_tbl PARTITIO N (datestam pendant 20140802')

Design of acquisition Source Table

Data acquisition mode: streaming data writing, batch data writing, periodic scheduling bar data insertion.

In the case of a large amount of data, ensure that the data of the same business unit is divided by partitions and tables; in the case of a small amount of data, optimize the collection frequency.

Streaming data writing.

For the data written by stream, there are generally many channels to be collected, and the relevant acquisition channels should be effectively distinguished. In the case of a large amount of data written in a single data channel, partition design should be carried out according to time.

In the case of a small amount of data in the acquisition channel, the non-partition table design can be adopted, and the terminal type and collection time can be designed as standard column fields.

When using Datahub to write data, we should reasonably plan the number of shard, and place the problem that the flow of acquisition channel is small and there are many channels due to too much shard.

Write bulk data. Batch data writing focuses on scheduling bar data insertion in the write cycle.

Avoid periodic data insertion, in which case it is necessary to establish a partition table and insert in the new partition to reduce the impact on the original partition.

The design of log table

The log is actually a flow table, which does not involve the update of records. Collect one log at a time and store more than one log together. The main points for attention in log table design are as follows:

Create table src_tbl (key0 STRING, key1 STRING, col0 STRING, col1 STRING, col2 STRING) PARTITIO N (datestam pairx STRING, pt0 STRING); set odps.exstore.primarykey=key0,key1; [set odps.exstore.ignorekey=col0;] EXSTO RE exstore_tbl PARTITIO N (datestam pairx STRING 20140801'); EXSTO RE exstore_tbl PARTITIO N (datestam pairxcake 20140802')

Consider whether the log needs to be de-reprocessed.

Consider whether you need to extend dimension attributes.

Whether it is necessary to extend the dimension attribute field of the associated dimension table, consider two points: the frequency of business use, and whether the association will cause the delay of the output.

You need to choose carefully whether to expand the dimension table.

Consider distinguishing terminal types.

Because of the large amount of log tables, it is considered that when business analysis is used, statistical analysis is usually done according to PC and APP. At the same time, the collection of PC and app is two sets of systems, so the usual practice is to design multiple detailed DWD tables according to the terminal.

If there are many terminals, but the amount of data is small, if the data of a terminal is less than 1T but the collection times are more, we can consider not partitioning the terminal and setting the terminal information as a common column.

Note:

For the partition design of the log table, the log table can be partitioned according to the time of log collection, data collection and integration can be carried out before entering the data, and a batch of data can be written and submitted once (usually 64m).

Log data rarely update the original partition, you can use insert to insert a small amount of data, but generally need to limit the number of inserts.

If you have a large number of update operations, you need to use insert overwrite operations to avoid small file problems.

Set a reasonable partition for the log table and configure archiving operations for hot and cold data that has not been accessed for a long time.

Design of interactive schedule

Periodic snapshot table, which stores all the records collected every day.

Problem: there are a lot of cumulative records in history. To generate a snapshot every day, you have to take the merge of the day and the full scale of the previous day, which is very resource-consuming. In order to calculate the number of new collections in the last day, you need to scan the full scale, how to reduce the resources?

The proposed scheme is to establish a transactional fact table and a periodic snapshot table to store current valid collections to meet the statistical analysis needs of different businesses.

Note:

The most important thing in designing interactive schedules is to distinguish the relationship between stock data and incremental data. -the data for the new partition can be written as incremental data.

Modifications and insertions to old partition data should be minimized.

Insert overwrite and insert into should be chosen as far as possible in the selection of data insertion and full table coverage.

Update and delete operations of MaxCompute table data

An example of the implementation of delete/update/merge SQL supported by relational databases on MaxCompute is as follows:

Table preparation

-- Daily full scale table1 (key1 string,key2 string,col1 string,col2 string);-- Today additional scale table2 (key1 string,key2 string,col1 string,col2 string);-- Today additional scale (delete) table3 (key1 string,key2 string,col1 string,col2 string)

Update (values of records in the table2 table, updated to the table1 table)

Insert overwrite table table1 select t1.key1 on t1.key1=t2.key1 and t1.key2 t1.key2MagneCase when t2.key1 is not null then t2.col1 else t1.col1 end as col1, case when t2.key1 is not null then t2.col2 else t1.col2 end as col2from table1 t1left outer join table2 T2 on t1.key1=t2.key1 and t1.key2 = t2.key2

Delete (records in table2 table, deleted from table1 table)

Insert overwrite table table1 select T1 on t1.key1=t2.key1 and t1.key2 1.key2, t1.col1, t1.col2from table1 t1left outer join table2 T2 on t1.key1=t2.key1 and t1.key2 = t2.key2 where t2.key1 is null

Merge (no del)

Insert overwrite table table1 selectfrom (--exclude records that existed in the previous day and also exist today from the upper calendar. What's left is that there are no updated records for select t1.key1, t1.col1, t1.col2from table1 t1left outer join table2 T2 on t1.key1=t2.key1 and t1.key2 = t2.key2 where t2.key1 is nullunion all--, and then merge today's increments, which is today's full select t2.key1select t2.key1, t2.key2, t2.col1, t2.col2from table2 T2) tt.

Merge (with del)

Insert overwrite table table1 select

From (

-- first exclude the records that existed the previous day and those that exist today from the upper calendar, and then exclude the records deleted today. All that's left is that there are no updated records today.

Insert overwrite table table1 selectfrom (--exclude records that exist the day before and also exist today from the upper calendar, and then exclude records deleted today. What is left is that there are no updated records for select t1.key1, t1.col1, t1.col2from table1 t1left outer join table2 T2 on t1.key1=t2.key1 and t1.key2 = t2.key2 left outer join table3 T3 on t1.key1=t3.key1 and t1.key2 = t3.key2where t2.key1 is null or t2.key1 is nullunion all--, and then merge today's increment, that is, today's full select t2.key1 t2.col1, t2.col2from table2 T2) tt; table creation design example

Scene: weather information collection.

Basic information: data information includes place names, information about the number of attributes such as area, basic population, and weather information.

The change of attribute data is small, the number of weather information is collected by multiple terminals, and the amount of data is large.

When the weather information changes greatly and the number of terminals is stable, the flow is basically stable.

Table Design Guide:

It is suggested that the data information be divided into basic attribute tables and weather log tables to distinguish between small and large changes.

Because of the huge amount of data, the weather log table can be partitioned according to region, or secondary partition according to time such as days, which avoids other irrelevant data changes caused by weather changes in a certain place or at a certain time.

Datahub is used on the collection terminal for data aggregation, the appropriate number of shard channels is selected according to the stable flow value, and the batch data is written to the weather log table without using Insert into.

Characteristic function Life cycle of MaxCompute Table

MaxCompute tables / partitions provide data lifecycle management. If the table (partition) data is calculated from the last update time and does not change after the specified time, the table (partition) will be automatically reclaimed by MaxCompute. This specified time is the life cycle, which is set to the table level.

Create table test_lifecycle (key string) lifecycle 100 alter table test_lifecycle set lifecycle 50

MaxCompute determines whether to reclaim the partition in the non-partitioned table or partition table based on the LastDataModifiedTime and lifecycle settings of each non-partitioned table or partition. MaxCompute SQL provides the touch operation to modify the LastDataModifiedTime of the partition. The LastDataModifiedTime of the partition is modified to the current time. By changing the value of LastDataModifiedTime, MaxCompute will think that the data of the table or partition has changed, and the life cycle calculation will start again.

ALTER TABLE table_nam e TO UCH PARTITIO N (partition_col='partition_col_valu estrangement,...)

Note:

Reasonable planning of the life cycle of the table and setting the life cycle when the table is created can effectively reduce the storage pressure.

Any change to the table data will affect the judgment time of the life cycle recovery data, including the merging of small files.

Avoid full table scanning

Table design:

Create a partition table or column design for scanning conditions.

Partition the data table reasonably. Set the column name to the common query criteria.

Read the commonly used query conditions for hash clustering

Data calculation:

Add partition filtering conditions, or reduce the number of scanning partitions, or remove the middle small table and then scan the historical partition of the small table to reduce the amount of data scanning.

The intermediate results of the global scan table are stored to form an intermediate table.

If you scan the partition for a year every day, the calculation consumption is very large. It is recommended to take out an intermediate table, summarize it once a day, and then scan the partition of the intermediate table for one year. The amount of scanned data will be greatly reduced.

Avoid small files

Small files generated by the Reduce calculation process: only the insert overwrite source table (or partition) is needed, or it is written to a new table to delete the source table.

Small files generated during Tunnel data collection are recommended:

Submit once when buffer reaches 64m when tunnelsdk is called

Avoid uploading small files frequently when using console, and it is recommended to upload them at one time when the accumulation is large. If the partition table is imported, it is recommended to set the life cycle for the partition and automatically clean up the data that is overdue and unused.

Same as the first scheme, insertoverwrite source table (or partition)

ALTER merge mode, merge through the console command.

It is recommended to add life cycle to the creation of temporary tables, and garbage collection will be collected automatically when it expires. -applying for too many datahub shard will cause small file problems. The strategy for applying for the number of datahub shard:

Default throughput single shard is 1MB/s, according to which the actual number of shard can be allocated (several more can be added on this basis)

The logic of synchronizing odps is that each shard has a separate task (5 minutes or 64MB will commit once). The default setting of 5 minutes is to find data in odps as soon as possible. If the partition is built on an hourly basis, that shard has 12 files per hour.

If the amount of data is small at this time, but there is a lot of shard, there will be a lot of small files (shard*12/hour) in odps.

Don't allocate too much shard, allocate as needed.

Convert Hash Clustering table

Advantages of Hash Clustering table: optimize Bucket Pruning/ optimize Aggregation/ optimize storage. Using CLUSTERED BY to specify Hash Key,MaxCompute when creating the table will Hash the specified column and distribute it to each Bucket according to the Hash value.

Hash Key refers to the principle of selection:

Select columns with fewer duplicate key values

SORTED BY is used to specify how fields are sorted within the Bucket.

How to convert to a HashClustering table:

ALTER TABLE table_nam e [CLUSTERED BY (col_nam e [, col_nam e,...]) [SO RTED B Y (col_nam e [ASC | DESC] [, col_nam e [ASC | DESC]...])] INTO num ber_of_buck ets BUCKETS]

The ALTER TABLE statement applies to the inventory table, and after the new aggregation attribute is added, the new partition will be stored in hash cluster. After creating the table for HashClustering, use insert overwrite to convert from another source table.

Note that the Hash Clustering table has the following restrictions:

Insert into is not supported, data can only be added through insert overwrite.

Tunnel is not supported to upload directly to the range cluster table because the data uploaded by tunnel is unordered.

At this point, the study on "what is the design method of MaxCompute table" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.