In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
Today, the reporter learned that in the 2023 AI Network Innovation Conference, Tencent Cloud Star vein Network won the "benchmarking Application Award." Sponsored by the China Society of Communications, the award aims to recognize excellent AI network industry application cases with strong commercial value, high service level and significant application benefits.
Xingmai Network is a large model exclusive high-performance network developed by Tencent Cloud. Based on Tencent Cloud's new generation computing cluster HCC, it can support a large computing scale of 100 million cards. at the same time, Xingmai network has the highest 3.2T communication bandwidth in the industry, the delay is reduced to 10us-40us, and the packet loss rate is reduced to zero, resulting in a 10-fold improvement in communication performance, a 40% increase in GPU utilization, and a 30% GPU 60% saving in model training cost.
Wang Yachen, vice president of Tencent Cloud, introduced Xingmai Network in the live sharing. Xingmai network is built based on high-speed Ethernet technology, which provides a high-performance interconnection base for Tencent mixed meta-model.
Wang Yachen pointed out that with the continuous development of the AI model, new requirements for network transmission and stability have been put forward. The traditional network architecture has been increasingly unable to meet the needs of large model training.
Wang Yachen, vice president of Tencent Cloud
For a large model with hundreds of billions and trillions of parameters, communication accounts for up to 50% in the training process, and the bandwidth of the traditional low-speed network is far from being supported. At the same time, traditional network protocols can easily lead to network congestion, high delay and packet loss, while only 0.1% of network packet loss may lead to 50% loss of computing power, resulting in a serious waste of computing resources. Large bandwidth, high utilization and lossless information are the core challenges faced by the network in the era of AI model.
Based on the comprehensive self-research capability, Tencent Cloud has upgraded and innovated both hardware and software in switches, communication protocols, communication libraries and operating systems, and took the lead in launching Xingmai Network, a leading large model exclusive high-performance network.
In terms of hardware, Xingmai Network is based on Tencent's network research and development platform, using fully self-developed equipment to build an interconnection base to achieve automatic deployment and configuration.
In terms of software, TiTa network protocol developed by Tencent Cloud adopts advanced congestion control and management technology, which can monitor and adjust network congestion in real time, meet the communication needs of a large number of server nodes, ensure smooth data exchange and low latency, achieve zero packet loss under high load, and achieve cluster communication efficiency of more than 90%.
In addition, Tencent Cloud has also designed a high-performance collective communication library TCCL for Star vein Network, which is integrated into a customized solution to enable the system to perceive network quality at a microsecond level. Combined with the dynamic scheduling mechanism, the reasonable allocation of communication channels can avoid the interruption of training caused by network problems and reduce the communication delay by 40%.
The availability of the network also determines the computing stability of the whole cluster. To ensure the high availability of Xingmai network, Tencent Cloud has developed an end-to-end full-stack network operation system to automatically demarcate and analyze end-to-end network problems through end-to-end network three-dimensional monitoring and intelligent positioning system. Reduce the overall troubleshooting time from sky level to minute level. At the same time, the overall deployment time of the large model training system has been reduced from 19 days to 4.5 days, ensuring that the basic configuration is 100% accurate.
The 2023 AI Network Innovation Conference is guided by the China Communications Society, sponsored by the Information and Communication Network Technology Committee of the China Communications Society, Jiangsu Future Network Innovation Research Institute, and co-sponsored by SDNLAB. Representatives of operators, Internet companies, equipment manufacturers, universities and scientific research institutions were invited to share and exchange on issues such as network interconnection architecture under AI, AI network equipment, high-performance network transmission technology, network scheduling and resource allocation, etc., to build China's first AI network vertical communication platform, consolidate network infrastructure for the steady development of AI industry, and drive the development of enabling network with technological innovation and upgrading.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.