What is the technical principle and development practice of MaxCompute Tunnel? 07/06 Update SLTechnology News&Howtos

What is the technical principle and development practice of MaxCompute Tunnel?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the principle of MaxCompute Tunnel technology and development practice, the content is very detailed, interested friends can refer to, hope to be helpful to you.

First, the technical principle of MaxCompute Tunnel

The above diagram is an architecture diagram, which shows that external services provide a unified SDK and are then integrated into all external services. On the server side, the services provided can be roughly divided into API layer and execution layer. There are two clusters in the API layer, the Frontend cluster is responsible for the intervention of the control flow, and the Tunnel cluster is responsible for the data. In the execution layer, it is divided into control cluster and computing cluster. The control cluster is responsible for resource control, meta management, and rights management. Computing cluster is responsible for actual computing and storage.

As you can see, Tunnel is a component of the API layer that is responsible for uploading and downloading data. The reason for this is that this is big data's system, so running a SQL on MaxCompute actually sends a control instruction. Since the target scenario is a query with a large amount of data, such as a billion entries, this is a relatively large-scale operation. If you want to synchronize data in this scenario, you cannot go through insert into like the traditional input of MySQL, because it is a waste of resources for insert into to control the cluster, and there will be some restrictions. The efficiency of one line at a time is very low, so separate control flow and data flow are designed.

The Tunnel cluster is responsible for providing API for Tunnel at the SDK layer, allowing users to access data in a structured way. In addition, Tunnel is the only data interface released to the outside world, which will check the format and permissions of the data written by the user, and control the data security. At the same time, it will ensure that the data written by users through Tunnel can be read by SQL, so you don't have to worry, for example, that SQL can't read the written data, or that the written data is different from the value read by SQL.

In addition, Tunnel directly accesses the storage layer, the underlying storage of MaxCompute is a distributed file system, and Tunnel accesses this file system directly, thus ensuring performance. In other words, Tunnel can ideally guarantee a throughput capacity of 10 megabits per second for single concurrency, and the entire throughput capacity can be expanded horizontally through false concurrency.

Second, the rich ecology of MaxCompute Tunnel

MaxCompute has a very rich ecology. It is recommended to first look at what tools or services are available. It is recommended to give priority to using some mature services and try not to write your own code first.

The official SDK has Java SDK and Python SDK.

In addition, officials have provided three tools. The MaxCompute client is a command-line tool that allows users to upload a local file to MaxCompute or download a table to a local file in terms of data synchronization. MaxCompute Studio is an idea plug-in that also supports file uploads and downloads. MMA2.0 Migration tool is a recently launched tool that can help users migrate data from the existing big data system to MaxCompute. These tools are based on SDK and are transmitted through SDK.

In addition to tools, MaxCompute is also integrated with third-party services, such as data channel map on cloud, SLS (Aliyun's log service) and DataHub (data channel). They all support MaxCompute delivery natively, and Kafka also has official plug-ins.

In terms of stream computing, Blink,Spark also has MaxCompute synchronization plug-ins. In terms of data synchronization services, DataWorks data synchronization, real-time synchronization and offline synchronization all support MaxCompute synchronization.

To sum up, if there is a need for data synchronization, it is best to see if the existing services can meet the demand. If you feel that you are not satisfied and want to develop on your own, you can take a look at what functions SDK can have, and some precautions in using it.

III. Brief introduction of Tunnel function

The figure above is a table of the overall functionality of Tunnel. There are now two sets of API, batch data channel and streaming data channel.

The throughput of single concurrency in the scenario of batch data channel target is very large. This ideal scenario is to transfer large amount of data, one batch at a time, QPS and concurrency can not be particularly high, but the throughput of single concurrency can be very large, which is also optimized on API.

Streaming data channel is a new service, because most of the current upstream services are infused by streaming services, that is, single concurrency may not have such a large flow, but they are relatively fragmented data. in this case, you will encounter a lot of restrictions if you use bulk data channels. The most obvious is the problem of small files. Writing particularly fragmented data with batch data channels will result in a large number of fragmented files, running SQL queries will be very slow, and downloading with Tunnel will be very slow. In view of this scenario, the platform provides streaming data channel service, which can be written in pieces through streaming data, or once per line. There is no need to worry about small files or concurrency. Concurrency can be unlimited. Streaming data channels are unlimited concurrency, but batch concurrency is limited.

As you can see from the table, these resources can be accessed through Tunnel: regular table, Hash Clustered table, Range Clustered table and Transactional table, and finally the query results, which can be downloaded. Both uploads of ordinary tables are supported. Hash Clustered table and Range Clustered table are not suitable for Tunnel to write, because the data needs to be stored in a system and can be sorted, while the Tunnel cluster is not as large as a computer cluster and does not have the ability to sort. Therefore, the general classic use of this kind of table is to write a regular table first, and then do an insert overwrite through SQL to generate a Hash Clustered table or Range Clustered table.

Streaming upload has been upgraded in architecture, and there is an asynchronous processing mechanism, which will process the data written by users in the background, so the Hash Clustered table will be supported later.

The Transactional table means that MaxCompute's regular tables do not support update or delete, and the system recently supported this syntax on SQL, that is, users can update, delete, or transaction. Bulk uploaded API now supports Transactional tables, but only supports append, also known as insert into, which cannot be update from Tunnel's API. Streaming is also being planned, and update may be completed in the future. Batch may not do update this function, but batch can now append this kind of Transactional table.

The query result means that if you run a SQL, there is a limit of 10, 000 query results on the odpscmd client or DataWorks. However, the query results can be downloaded through Tunnel, so there is no limit on the number of entries, and the complete query results can be downloaded locally.

All in all, if you use SDK, you can achieve these functions in the table.

IV. How to use SDK

1) basic configuration

If you want to develop something that needs to be configured, no matter upload, download, or stream upload, these configurations are the same. First, you need to create an ODPS object and a Table Tunnel object. If you want to run SQL with SDK, creating ODPS;TableTunnel is a class of the Tunnel entry from which all the functionality is initiated.

Then look at the specific configuration items, the left side of the figure enumerates a few more key. Access ID and Access Key are the account information, and Aliyun uses this to represent an account.

ODPS Endpoint is an entrance to the service, and there should now be 21 region on the public cloud, including financial cloud and government cloud, 7 in China and 14 overseas. The endpoint of each region is different. When you use it, you need to find the region service you purchased and enter the endpoint correctly.

Tunnel Endpoint is optional. If it is left empty, the system will automatically route to the corresponding Tunnel endpoint through the filled ODPS endpoint. The network environment on the public cloud is relatively complex, which is divided into public network domain name and private network domain name, and private network domain name is also divided into classic network and VBC. There may be scenarios in which routed endpoint networks are not accessible. At this time, the platform provides an interface, and users can enter the accessible Tunnel endpoint. In this way, the entered Tunnel endpoint will be given priority over routing, but it is not necessary in 99% of the cases.

The parameter Default project is often used in the projectile. MaxCompute is rich in permission management. For example, if you have multiple project on the public cloud and want to control the flow of data across project, you can configure it with this parameter. The Default Project set in ODPS can be understood as the original project, and there is another project in the create Stream Session below, that is, the project where the data is to be accessed. If the two project are different, the system checks this permission to see if the user can access the data of the target project. If you run SQL, it will also control the use of resources according to the original project. If there is only one, these two can be filled in the same.

Generally speaking, Access ID,Access Key, and ODPS Endpoint are required, Tunnel Endpoint is optional, and if there is only one Default project, just fill in one.

2) specific upload API

Next, show the specific upload interface. First of all, let's look at batch upload.

[batch upload]

As you can see in the figure above, the process of batch upload is to create a upload session (line 31), then open writer, write the data with writer, then close, and then upload session plus commit.

Upload session can be understood as an object of a session, similar to the concept of transaction. This upload is in units of upload session, that is, in the end, when upload session commit succeeds, the data will be visible. Multiple writer can be open within a upload session, and multiple writer can be uploaded concurrently, but the writer is stateful and needs to be assigned a non-repetitive block ID to avoid overwriting. Upload session is also stateful, and without commit, you can't see it; if commit succeeds, the session is over and you can't go to open writer for the time being. The implementation principle of Writer is to open a writer request, the system will send a HTP request to the server, and then maintain the long link. When writing data, the platform will write the data to the server in real time, and writer will write a temporary directory. According to this mechanism, you can see that if writer or close fails, the long connection is broken. So the writer and close interfaces cannot be retried, and if any phase in the writer fails, it needs to be rewritten.

In addition to normal commit, MaxCompute also allows users to check the correctness of the data. For example, when you open five writer,commit, you can upload these five ID as examples for confirmation. If you check that the server side is inconsistent with this example, commit will report an error.

To sum up, the basic functional points are:

Batch upload is stateful concurrency.

The data is not visible until commit is successful.

Support for insertOverwrite, as well as InsertInto semantics.

Insert overwrite refers to the ability to directly overwrite an entire partition or a table using the data of a certain upload session in commit, which is similar to SQL's Insert and Overwrite.

There are also restrictions on the use of this feature.

First, a upload session cannot exceed 20, 000 Block.

Second, Block ID can lead to data overwriting.

Third, upload session expires in 24 hours, because writer data is written in a temporary directory, and temporary data has a recycling cycle. Data written by writer may be recycled after 24 hours, which limits the life cycle of upload session.

Fourth, if open has a writer but does not write data, it is equivalent to taking up a free link, and the server will break the link directly.

[streaming upload]

Next, take a look at the interface for streaming upload. As mentioned earlier, streaming upload is simplified on API and removes concurrency restrictions and time restrictions.

As you can see in the figure, the interface is CreateStreamUploadSession, and the data writer has been changed from writer to RecordPack. The so-called pack is actually equivalent to a buffer in memory, you can use pack.append (record). For example, to judge a size, you only need to determine that the buffer is large enough or the number of buffer is large enough, and then flush it (lines 42 to 44). Pack does not write to the network, but to memory. Therefore, unlike writer,flush, it can be retried because the data is in memory. And Pack has no state, so you don't need to care about writer's Block ID and so on. In addition, because the data is visible after the success of flush, session does not have the status of commit. Therefore, if you want to develop distributed services, this is much simpler than batch upload, there are not too many restrictions, just make sure that the local memory is large enough.

At the same time, the system also supports the reuse of memory, that is, the pack after flush can be reused. The system will retain the memory that was full last time to avoid generating GC. Streaming upload only supports InsertInto, because there is no other interface on API, so InsertOverwrite semantics is not supported now. In addition, streaming services support asynchronous data processing, that is, in addition to ensuring that the data written by users through streaming is readable, the server also has a mechanism to identify the newly written data and stock data, and can do some asynchronous processing of the newly written data, such as zorder by sorting and ink paper.

ZorderBy sorting refers to the organization of data, which can reorganize the data with MaxCompute according to certain rules, so that the query will be very efficient. Ink paper refers to the support to rewrite the data at the back end and reorganize some very broken data into data files with high storage efficiency. On this basis, you can also do some sorting and other processing, and more functions will be added later.

There are also some restrictions on streaming uploads. First of all, when writing, the system will lock the table, other operations can not be written when streaming, such as InsertInto and Insert Overwrite will fail, you have to stop the streaming before you can write normally. In addition, there are some delays in DDL. If you want to drop table or rename table, you may be able to write a few pieces of data successfully after drop, with a delay of up to 60 seconds. If there is such a scenario, it is recommended to stop the streaming before going to drop or rename.

[batch download]

Next, we will introduce the interface for batch download.

As you can see in the figure, TableTunnel creates an object called downloadSession. You can get record Count, which refers to the total number of rows in a partition or table. Here is open reader, which corresponds to batch uploads: reader and writer; uploadSession and downloadSession. Openreader is distinguished by record, for example, there are 1000 lines, which can be downloaded concurrently into ten lines. Download supports column tailoring, that is, you can download several of them. Download the query result and change the TableTunnel entry class to InstanceTunnel,odps. Line 53 is not project, table, but an InstanceID.

In terms of usage restrictions, similar to bulk uploads, the DownloadSession limit is 24 hours because it also has temporary files. Similarly, idle links are timed out for 120s, and there is also a Project-level concurrent current limit, which affects performance by fragmented files.

V. Best practices

If the concurrency is high, batch API is not recommended, because the concurrency limit is project. If the upload or download limit is full, the batch upload of the entire project will fail.

This interface is recommended to reduce concurrency and then make full use of the concurrency throughput capacity of about 10 megabits per second. Streaming is not restricted by concurrency for architectural reasons. QPS does not recommend batch upload, because of the problem of fragmented files, it is not recommended to use a very high QPS to write data with batch API. If QPS and concurrency are not high, the use of these three methods will not be very limited.

There are several other scenarios. Transaction now supports batch upload, and streaming upload will follow. Currently, streaming upload does not support Insert Overwrite, and it may not be developed later, because this scenario is obviously a batch semantics.

On the MaxCompute Tunnel technical principles and development practice is how to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.