In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
From content generation, game development, to automation assistant, robot control. AIGC this "cool techs", is playing a role in more areas, and gradually infiltrating into the industry. For example, in the digital government scenario, through the integration of AIGC and digital human technology, we can provide personalized government services and consultation, optimize the interaction between the government and citizens, and improve the quality and efficiency of government services. At present, AIGC mainly focuses on three application scenarios: large model training, MaaS model service and AIGC reasoning, in which large model training is the main scene of competition among emerging enterprises.
"Don't play with big models without a good network." In order to build a large-scale training model cluster, we not only need basic components such as GPU server and network card, but also need to solve the problem of network construction. Network is very important for the computing power release and reliable operation of large model cluster. How to build a network system that meets the requirements of large model cluster computing is one of the keys to promote the development of AIGC.
The requirements of "three excesses" of network system in AIGC large model training
In the process of large model training, there are three kinds of traffic models, which are tensor parallelism, pipelining parallelism and data parallelism. The well-known ChatGPT3 uses 128A100 servers with a total of 1024 A100 cards for training, so a single server node needs four 100G network channels, while other large models such as ChatGPT4 and ChatGPT5 will have a higher demand for the network. Tide Network believes that the requirements of large model training for the network can be summarized by the "three super" network, that is, super-large scale, ultra-high bandwidth, super-strong and reliable, in order to ensure the stable and reliable operation of the network, and provide strong support for large model training.
In order to solve the challenge of the "three super" networks, we need to focus on how to build a networking scheme that meets the needs of large-scale training. From the perspective of networking architecture, the current AIGC networking generally adopts fat tree architecture, which has the characteristics of high bandwidth, low delay and good scalability. In terms of networking protocols, the current mainstream in the industry is based on IB and RoCE two lossless network technologies, both of which can well meet the requirements of large-scale training with high bandwidth and low delay. The latency of IB is low enough, while RoCE is superior in terms of openness, cost performance, and maintainability.
Trend and driver of Tide Network Convergence to build an Intelligent Lossless Network solution based on RoCE
Tide Network, as the leader of cloud-side collaborative intelligent network, pays close attention to the development and changes of the market, and launches an intelligent lossless network solution based on RoCE, which helps to build the "three super" network of AIGC. It has the following advantages:
One is the integration of multi-protocols and multi-scenarios. In large-scale clusters, there are many scenarios, such as general computing cluster, AI / HPC cluster, storage and so on. The traditional scheme is to deploy multiple networks and protocols, such as Ethernet, IB, FC, etc., which are incompatible with each other, which greatly increases the difficulty of management and maintenance. The intelligent lossless network solution based on RoCE can adapt to general computing, AI / HPC, storage and other scenarios, and achieve Ethernet / IB / FC triple play. In this way, from maintaining multiple networks to maintaining one network, the overall construction and maintenance costs are greatly reduced.
The second is intelligent flexibility and dynamic adjustment. In large-scale cluster training, the whole cluster is required to be deployed and delivered quickly, saving training time and reducing downtime and other failures as much as possible. In the intelligent lossless network solution based on RoCE of Chaoyang Network, the automatic deployment of cluster network can be realized through digital network engine IDE, and the service can be put online quickly. And real-time monitor the load and health status of equipment and links, such as CRC packet errors, port bandwidth percentage, queue cache, CNP and Pause reverse pressure frames, etc., to complete the rapid fault location and intelligent analysis, and achieve service-based network tracking. In addition, it can also provide northbound standard API interface, which can connect with the upper computing platform, realize the linkage of computing network, and better release the computing power of the cluster.
Previously, Chaochao Network's intelligent lossless network solution based on RoCE has been applied in the educational and scientific research customer project, which can fully meet the general computing cluster, GPU accelerated cluster, heterogeneous computing cluster, distributed storage cluster, all-flash storage cluster and other scenarios, for the high bandwidth and low latency connection requirements of the network, to help customers build an overall network architecture that meets the development of AIGC.
In the future, Chaochao Network will continue to optimize the solution capabilities of intelligent lossless network products based on RoCE, while deeply studying UEC-based networks and innovating products that support UEC to help customers succeed.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.