Tencent released a new generation of super computing cluster: training for large models with a threefold improvement in performance 04/24 Update SLTechnology News&Howtos

Tencent released a new generation of super computing cluster: training for large models with a threefold improvement in performance

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thanks to the CTOnews.com netizen Mr. Air, Xiao Zhanche's clue delivery! CTOnews.com April 14 news, CTOnews.com learned from Tencent official that Tencent Cloud has released a new generation of HCC high-performance computing cluster, using the latest generation of Xinghai self-developed server, equipped with Nvidia H800 Tensor Core GPU.

According to Tencent officials, the cluster is based on a self-developed network and storage architecture, bringing 3.2 terabytes of ultra-high interconnection bandwidth, TB-level throughput and 10 million-level IOPS. The measured results show that the computing performance of the new generation cluster is 3 times higher than that of the previous generation.

In October last year, Tencent completed its first trillion-parameter AI big model-mixed element NLP big model training. Under the same data set, the training time is reduced from 50 days to 11 days. If it is based on the new generation cluster, the training time will be further reduced to 4 days.

At the computing level, the performance of a single server is the basis of cluster computing power. Tencent Cloud's new generation cluster single GPU card can output up to 1979 TFlops of computing power with different precision.

For large model scenarios, Xingxinghai self-research server adopts 6U ultra-high density design, which is 30% higher than the shelf density supported by the industry. Using the concept of parallel computing, through the integrated design of CPU and GPU nodes, the single point computing performance is improved to higher.

At the network level, there is a huge demand for data interaction between computing nodes. With the expansion of the cluster scale, the communication performance will directly affect the training efficiency, so it is necessary to achieve the maximum cooperation between the network and computing nodes.

The Xingmai high-performance computing network developed by Tencent claims to have the highest 3.2T RDMA communication bandwidth in the industry. The measured results show that the overall computing power of the cluster with the same number of GPU,3.2T star pulse network is 20% higher than that of 1.6T network.

At the same time, TCCL, a high-performance collective communication library developed by Tencent, is integrated into customized solutions. Compared with the industry open source collection communication library, 40% load performance is optimized for large model training, and the training interruption caused by multiple network reasons is eliminated.

At the storage level, in the large model training, a large number of computing nodes will read a batch of data sets at the same time, so it is necessary to shorten the data loading time as much as possible to avoid waiting for computing nodes.

The self-developed storage architecture of Tencent Cloud has TB-level throughput capacity and tens of millions of IOPS levels, supporting storage requirements in different scenarios. COS+GooseFS object storage scheme and CFS Turbo high-performance file storage scheme fully meet the requirements of high performance, large throughput and mass storage in large model scenarios.

In addition, the new generation cluster integrates the TACO training acceleration engine developed by Tencent Cloud, and makes a large number of system-level optimizations on network protocols, communication policies, AI framework and model compilation, greatly saving training tuning and computing costs.

AngelPTM, the training framework behind Tencent's mixed meta model, has also provided services through Tencent Cloud TACO to help enterprises accelerate the implementation of large models.

Through the large modeling capabilities and toolkits of Tencent Cloud TI platform, enterprises can conduct fine tuning training combined with industrial scenario data to improve production efficiency and quickly create and deploy AI applications.

Relying on the native governance capabilities of distributed clouds, Tencent Cloud's computing platform provides floating-point computing power of 16 EFLOPS.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.