In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "what are the common problems of Tunnel". In daily operation, I believe many people have doubts about the common problems of Tunnel. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the questions of "what are the common questions of Tunnel?" Next, please follow the editor to study!
Basic introduction and application scenarios
Tunnel is an offline batch data channel service provided by MaxCompute, which mainly provides mass upload and download of offline data.
Only provide scenarios where each batch is greater than or equal to 64MB data. For small batch streaming data scenarios, please use DataHub real-time data channel to achieve better performance and experience.
SDK upload Best practices import java.io.IOException;import java.util.Date;import com.aliyun.odps.Column;import com.aliyun.odps.Odps;import com.aliyun.odps.PartitionSpec;import com.aliyun.odps.TableSchema;import com.aliyun.odps.account.Account;import com.aliyun.odps.account.AliyunAccount;import com.aliyun.odps.data.Record;import com.aliyun.odps.data.RecordWriter;import com.aliyun.odps.tunnel.TableTunnel;import com.aliyun.odps.tunnel.TunnelException Import com.aliyun.odps.tunnel.TableTunnel.UploadSession;public class UploadSample {private static String accessId = ""; private static String accessKey = ""; private static String odpsUrl = "http://service.odps.aliyun.com/api"; private static String project ="; private static String table ="; private static String partition ="; public static void main (String args []) {/ / preparation, Account account = new AliyunAccount (accessId, accessKey); Odps odps = new Odps (account) Odps.setEndpoint (odpsUrl); odps.setDefaultProject (project); TableTunnel tunnel = new TableTunnel (odps); try {/ / confirm write partition PartitionSpec partitionSpec = new PartitionSpec (partition) / / create a session,24 on the server that is valid for 24 hours. Within 24 hours, the session can upload a total of 20000 Block data / / the time it takes to create a Session is seconds. Some resources will be used on the server, temporary directories will be created, and the operation is relatively heavy. Therefore, it is strongly recommended that the same partition data can be uploaded by Session as much as possible. UploadSession uploadSession = tunnel.createUploadSession (project, table, partitionSpec); System.out.println ("Session Status is:" + uploadSession.getStatus (). ToString ()); TableSchema schema = uploadSession.getSchema () / / after preparing the data, open Writer to write the data. After preparing the data, write a Block. Each Block can only be uploaded once successfully and cannot be uploaded repeatedly. A successful CloseWriter means that the Block upload is completed. If you fail, you can re-upload the Block. A maximum of 20000 BlockId is allowed under the same Session, that is, 0-19999. If more than that, please CommitSession and create a new Session for use, and so on. / / too little data written in a single Block will result in a large number of small files seriously affecting computing performance. It is strongly recommended that each time you write data above 64MB (data within 100GB can be written to the same Block) / / the total amount can be calculated roughly from the average size of data and the number of records, namely 64MB.
< 平均记录大小*记录数 < 100GB // maxBlockID服务端限制为20000,用户可以根据自己业务需求,每个Session使用一定数量的block例如100个,但是建议每个Session内使用block越多越好,因为创建Session是一个很重的操作 // 如果创建一个Session后仅仅上传少量数据,不仅会造成小文件、空目录等问题,还会严重影响上传整体性能(创建Session花费秒级,真正上传可能仅仅用了十几毫秒) int maxBlockID = 20000; for (int blockId = 0; blockId < maxBlockID; blockId++) { // 准备好至少64MB以上数据,准备完成后方可写入 // 例如:读取若干文件或者从数据库中读取数据 try { // 在该Block上创建一个Writer,writer创建后任意一段时间内,若某连续2分钟没有写入4KB以上数据,则会超时断开连接 // 因此建议在创建writer前在内存中准备可以直接进行写入的数据 RecordWriter recordWriter = uploadSession.openRecordWriter(blockId); // 将读取到的所有数据转换为Tunnel Record格式并切入 int recordNumber = 1000000; for (int index = 0; i < recordNumber; i++) { // 将第index条原始数据转化为odps record Record record = uploadSession.newRecord(); for (int i = 0; i < schema.getColumns().size(); i++) { Column column = schema.getColumn(i); switch (column.getType()) { case BIGINT: record.setBigint(i, 1L); break; case BOOLEAN: record.setBoolean(i, true); break; case DATETIME: record.setDatetime(i, new Date()); break; case DOUBLE: record.setDouble(i, 0.0); break; case STRING: record.setString(i, "sample"); break; default: throw new RuntimeException("Unknown column type: " + column.getType()); } } // Write本条数据至服务端,每写入4KB数据会进行一次网络传输 // 若120s没有网络传输服务端将会关闭连接,届时该Writer将不可用,需要重新写入 recordWriter.write(record); } // close成功即代表该block上传成功,但是在整个Session Commit前,这些数据是在odps 临时目录中不可见的 recordWriter.close(); } catch (TunnelException e) { // 建议重试一定次数 e.printStackTrace(); System.out.println("write failed:" + e.getMessage()); } catch (IOException e) { // 建议重试一定次数 e.printStackTrace(); System.out.println("write failed:" + e.getMessage()); } } // 提交所有Block,uploadSession.getBlockList()可以自行指定需要提交的Block,Commit成功后数据才会正式写入Odps分区,Commit失败建议重试10次 for (int retry = 0; retry < 10; ++retry) { try { // 秒级操作,正式提交数据 uploadSession.commit(uploadSession.getBlockList()); break; } catch (TunnelException e) { System.out.println("uploadSession commit failed:" + e.getMessage()); } catch (IOException e) { System.out.println("uploadSession commit failed:" + e.getMessage()); } } System.out.println("upload success!"); } catch (TunnelException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }} 构造器举例说明: PartitionSpec(String spec):通过字符串构造此类对象。 参数: spec: 分区定义字符串,比如: pt='1',ds='2'。 因此程序中应该这样配置:private static String partition = "pt='XXX',ds='XXX'"; 常见问题MaxCompute Tunnel是什么? Tunnel是MaxCompute的数据通道,用户可以通过Tunnel向MaxCompute中上传或者下载数据。目前Tunnel仅支持表(不包括视图View)数据的上传下载。 BlockId是否可以重复? 同一个UploadSession里的blockId不能重复。也就是说,对于同一个UploadSession,用一个blockId打开RecordWriter,写入一批数据后,调用close, 然后再commit完成后,写入成功后不可以重新再用该blockId打开另一个RecordWriter写入数据。 Block默认最多20000个,即0-19999。 Block大小是否存在限制? 一个block大小上限 100GB,强烈建议大于64M的数据,每一个Block对应一个文件,小于64MB的文件统称为小文件,小文件过多将会影响使用性能。 使用新版BufferedWriter可以更简单的进行上传功能避免小文件等问题 Tunnel-SDK-BufferedWriter Session是否可以共享使用,存在生命周期吗? 每个Session在服务端的生命周期为24小时,创建后24小时内均可使用,也可以跨进程/线程共享使用,但是必须保证同一个BlockId没有重复使用,分布式上传可以按照如下步骤: 创建Session->Data volume estimation-> assign Block (for example, thread 1 uses 0-100, thread 2 uses 100-200)-> prepare data-> upload data-> Commit all Block written successfully.
Is there any consumption to the system when Session is not used after creation?
Each Session will generate two file directories when it is created. If you create a large number of files without using them, it will lead to an increase in temporary directories, which may cause a burden on the system when a large number of files are accumulated. Be sure to avoid such behavior and share and utilize session as much as possible.
What should I do if I encounter Write/Read timeout or IOException?
When uploading data, each time Writer writes 8KB data, it triggers a network action. If there is no network action within 120 seconds, the server will actively close the connection. At that time, Writer will not be available. Please reopen a new Writer write.
It is recommended to use API [Tunnel-SDK-BufferedWriter] to upload data. This API shields the details of blockId from users, and there is a data cache inside, so it will automatically retry if it fails.
When downloading data, Reader has a similar mechanism. If there is no network IO for a long time, the network IO will be disconnected. It is recommended that the Read process be carried out continuously without interspersing the interfaces of other systems.
What languages does MaxCompute Tunnel currently have in SDK?
MaxCompute Tunnel currently offers a Java version of SDK.
Does MaxCompute Tunnel support multiple clients upload the same table at the same time?
Support.
MaxCompute Tunnel is suitable for batch upload or streaming upload.
MaxCompute Tunnel is used for batch upload, but not suitable for streaming upload. Streaming upload can use [DataHub high-speed streaming data channel] with millisecond delay for writing.
Does the partition have to exist when MaxCompute Tunnel uploads data?
Yes, Tunnel does not automatically create partitions.
The relationship between Dship and MaxCompute Tunnel?
Dship is a tool for uploading and downloading via MaxCompute Tunnel.
Is the behavior of Tunnel upload data appended or overwritten?
Additional mode.
What about the Tunnel routing function?
Routing function refers to the ability of Tunnel SDK to obtain Tunnel Endpoint by setting MaxCompute. Therefore, SDK can only set the endpoint of MaxCompute to work properly.
When uploading data with MaxCompute Tunnel, the amount of data in a block is more appropriate.
There is no absolute optimal answer, it is necessary to take into account the network situation, real-time requirements, how to use data and cluster small files and other factors. In general, if the number is large, it can be uploaded continuously in 64m-256m.
If it is a batch mode that is sent once a day, it can be set to about 1G.
Always prompt timeout when downloading using MaxCompute Tunnel
Generally, it is an endpoint error. Please check the Endpoint configuration. The simple way to judge is to check the network connectivity through telnet and other methods.
Download via MaxCompute Tunnel and throw You have NO privilege 'odps:Select' on {acs:odps:*:projects/XXX/tables/XXX}. Exception of project 'XXX' is protected
The project enables data protection, and the user operates from the data of one project to another, which requires the owner operation of the project.
Tunnel upload throws exception ErrorCode=FlowExceeded, ErrorMessage=Your flow quota is exceeded.**
Tunnel controls the concurrency of requests. By default, the concurrent Quota for uploads and downloads is 2000. Any related requests will occupy one Quota unit from the end to the end. If a similar error occurs, there are several suggested solutions:
1 sleep and try again
2 to increase the tunnel concurrent quota of project, contact the administrator to evaluate the traffic pressure
3 report project owner to investigate who occupies a large number of concurrent quota, control it.
At this point, the study of "what are the common questions of Tunnel" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.