How can the big model of "burning money" get over the hurdle of storage? 04/15 Update SLTechnology News&Howtos

How can the big model of "burning money" get over the hurdle of storage?

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Almost every industry is talking about large models, every industry giant is training large models, and artificial intelligence has entered the era of large model dominance.

If you want to occupy the highland of large model applications, data and computing can be said to be an indispensable cornerstone. There has been so much discussion about computing that Nvidia's market capitalization quadrupled in 2023. Also should not be underestimated is the data, in addition to the explosive growth of data, data read, write, transmission and other basic performance, began to encounter more and more new challenges.

01 A hurdle that must be overcome in the calculation power of "squeezing dry"

In the perception of many people, training large models is a costly business. It is rumored that the training cost of GPT-4 is as high as $1 billion, and it takes the upfront investment of a "unicorn" to unleash the "magic" and the ability to "emerge".

To be more specific, in the cost composition of large model training, hardware investment includes computing power, transport capacity, and survivability, of which computing power-related hardware investment accounts for 80%. After all, the price of a 80GB A100 chip abroad is as high as 15000 US dollars. A large model with hundreds of billions of parameters often requires tens of thousands of A100 calculators. However, in the actual training process, the average utilization rate of GPU is less than 50%. The restrictive factors include frequent tuning of large model parameters, long recovery period after training interruption, slow data loading speed, and so on.

To put it bluntly, every minute of idle math resources is burning funds. if we can further improve the utilization rate of math resources, it will indirectly reduce the training cost of the large model. One hurdle that must be overcome when it comes to the utilization of computing power is the challenge of data read and write performance.

In the process of training, the large model needs to read a piece of data first, then train after the data reading is completed, and the next piece of data will be read in the training process. If the next piece of data is not read at the end of the training, it will result in a certain waiting time. Coupled with the interruption of training caused by network fluctuations and arithmetic failures, that is, the Checkpoint moment, the restart training will return to the previous node, which will also result in the waiting time of computing vacancy.

Less optimistic is that the current training data usually exists in the form of small files such as pictures and documents, which means that the data needs to be read and written frequently in the training process, and needs to support fast random access. What's more, the original data set trained by the large model often has dozens of TB, and the loading speed of the small files in the current file system is not enough 100MB/s, which virtually limits the running efficiency of the whole system.

According to the first principle, the inducement of low utilization of computing power in large model training is a large number of small files, which can not be processed efficiently by traditional storage systems, resulting in slow loading speed. In order to achieve the maximum efficiency of large model training and reduce unnecessary waste, we must make efforts on data, to be exact, we must innovate in data storage performance.

Huawei has been engaged in high-performance NAS storage for many years, and its OceanStor Dorado all-flash NAS has industry-leading performance, especially in massive small file scenarios.

At the openEuler developer Conference 2023, Huawei also launched the NFS + protocol with openEuler, aiming directly at the performance of client access to OceanStor Dorado NAS, in an attempt to shorten the waiting time in large model training and squeeze out the value of computing power as much as possible by introducing an external high-performance parallel file storage system.

02 "dragon slaughter" brought about by Huawei NFS + agreement

Before unveiling Huawei's NFS + agreement, it seems necessary to review the history of the NFS agreement. As a distributed file system protocol developed by Sun in 1984, NFS has existed for nearly 40 years and is widely used in finance, EDA simulation, phone bill, bill imaging and other industries.

It's just that with the passage of time, the "veteran" NFS gradually exposed some shortcomings. For example, traditional NFS specifies only one server IP address for a single mount point. In the case of a network port failure or link failure, the mount point may be inaccessible. When the IP is not aware of one end failure, it only relies on the application layer to mount the file system manually, and the dual active links cannot be automatically switched. The performance of a single mount point is limited by the performance of a single physical link, and important businesses have performance bottlenecks.

About two years ago, Huawei began research and development of the NFS + protocol, focusing on solving the shortcomings of traditional NFS, and finally handed over a "high reliability and high availability" answer:

One is reliability. For example, there is only one way between the client and the server of the traditional NFS. The NFS + protocol allows multiple IP access to a single NFS mount point, which is tantamount to building multiple paths between the client and the server, which cleverly solves the "reliability" problem criticized by the traditional NFS.

The second is multi-link aggregation. When there is only one road between the client and the server, it will lead to traffic congestion in the event of an accident, while the NFS + protocol realizes the balanced distribution of IO from a single mount point on multiple links under the support of the routing algorithm, ensuring the smooth data transmission between the server and the client.

The third is cache acceleration. When training large models, the metadata needs to be cached to the computing node. Traditional NFS is relatively conservative, and the cache expiration time is relatively short. On the other hand, NFS + protocol improves the cache size and failure mechanism, and allows more metadata to be stored on the host side for a longer time to meet the high latency requirements of large model training.

Fourth, data view synchronization. As mentioned earlier, large model training requires fast random access. NFS + protocol adopts the way of data view synchronization. When large model training needs to read the data of a node, it can efficiently place and access data with the corresponding node to find the optimal access link.

To sum up, NFS + protocol adopts the design of high-performance parallel file storage system, and makes special optimizations for large and small file scenarios, such as multi-link aggregation, cache acceleration, data view synchronization, etc., all of which improve the read and write performance of massive small files, and finally achieve "fast reading and writing, less waiting" in the process of large model training, reducing the idle time of computing power.

A set of Client test data confirms that the route of NFS + protocol is correct: compared with traditional file storage, the random read performance of small IO of training samples is improved by more than 4 times, and the bandwidth capacity of CheckPoint large file slices + multipath transfer is improved by 4-6 times, which is enough to meet the stringent requirements of large model training.

03 data storage enters the "big model era"

To some extent, the data storage performance requirements generated by large model training is only one aspect of the accelerated evolution of file storage systems.

Until today, the demand for file storage is still constantly updated, and file system innovation continues to occur, just like the evolution direction reflected by the training needs of large models.

You know, a training node of Nvidia can process 20,000 pictures per second, and each node needs 80,000 IOPS. The typical configuration of a large model is 100 billion parameter kcal, which requires a very high frequency of reading and writing of a large number of small files per unit time.

This may also be the reason why Huawei and openEuler jointly announced the NFS + agreement, and the sudden acceleration of demand for file system innovation is bound to trigger an "arms race" among head technology companies around data storage. Huawei is undoubtedly one of the leading players in this race.

But if you have a little understanding of the market pattern of file storage systems, Huawei's self-developed NFS + agreement also hides another profound meaning.

On the one hand, the MDS scheme of Lustre, GFPS, BeeGFS and other parallel systems separates metadata from file data access, and there are still bottlenecks in performance and reliability, while the metadata of NFS + protocol is no longer focused on a performance node, but is allocated to all nodes in the cluster, which can achieve multiple connections on the host side, eliminating the underlying bottleneck of high-frequency processing of small files in the context of large models.

On the other hand, from the point of view of most users, NFS + protocol can be better compatible with existing usage habits. The original operation and maintenance mechanism and knowledge system based on traditional NFS are not invalidated, and the file system switching process is smoother. Without modifying the data surface of the operating system, NAS storage access performance and reliability can be improved by 6 times and 3 times, and the big model training tide is embraced at very low cost.

It is undeniable that the large model is changing from the "hot" of the foreground to the collaborative drive of the whole industrial chain, of which data storage is a key link.

Under this trend, the attention of the industry will shift from "mold-making" to more efficient and faster "mold-making". Indicators such as the collection and loading performance of a large number of small files and the utilization rate of computing resources will be concerned by more and more enterprises. it is bound to set off a file storage revolution that simplifies complexity.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.