Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Uncover the sky-high price behind ChatGPT, tens of thousands of Nvidia A100, burning up Microsoft hundreds of millions of dollars

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Behind the ChatGPT is Microsoft's super-expensive supercomputer, which costs hundreds of millions of dollars and uses tens of thousands of chips.

ChatGPT can become the top model all over the world today, without the super computing power behind it.

The data show that the total computing power consumption of ChatGPT is about 3640PF-days (that is, if you calculate a trillion times per second, it takes 3640 days to calculate).

So how did the Microsoft supercomputer built for OpenAI come into being?

Microsoft posted a series of posts on its official blog on Monday to decrypt the super-expensive supercomputer and the massive upgrade of Azure, including thousands of Invidia's strongest H100 graphics cards and faster InfiniBand network interconnection technology.

Based on this, Microsoft also announced the latest ND H100 v5 virtual machine with the following specifications:

8 NVIDIA H100 Tensor Core GPU interconnected through next-generation NVSwitch and NVLink 4.0

Each GPU has 400Gb / s NVIDIA Quantum-2 CX7 InfiniBand, and each virtual machine has a 3.2Tb / s non-blocking fat tree network

NVSwitch and NVLink 4.0have bidirectional bandwidth of 3.6TB / s between 8 local GPU per virtual machine

The fourth generation Intel Xeon scalable processor

PCIE Gen5 to GPU interconnection, each GPU has 64GB / s bandwidth

16-channel 4800MHz DDR5 DIMM

About five years ago, OpenAI came up with a bold idea for Microsoft to build an artificial intelligence system that could change the way people interact with each other forever.

At the time, no one could have imagined that this would mean that AI could create any picture described by humans in pure language, and humans could use chatbots to write poems, lyrics, papers, emails, and menus.

In order to build this system, OpenAI needs a lot of computing power-the kind that can really support very large-scale computing.

But the question is, can Microsoft do it?

After all, there was no hardware to meet the needs of OpenAI, and it was impossible to determine whether building such a huge supercomputer in the Azure cloud service would directly destroy the system.

Then Microsoft began a difficult period of groping.

Nidhi Chappell, head of High performance Computing and artificial Intelligence at Microsoft Azure (left) and Phil Waymouth, Senior Director of Strategic Partnerships, Microsoft (right) to build a supercomputer that supports the OpenAI project, it spent hundreds of millions of dollars connecting tens of thousands of Nvidia A100 chips to the Azure cloud computing platform and revamping server racks.

In addition, in order to tailor this supercomputing platform for OpenAI, Microsoft is very dedicated and has been keeping a close eye on the needs of OpenAI to keep abreast of their most critical needs when training AI.

What is the cost of such a big project? Scott Guthrie, Microsoft's executive vice president in charge of cloud computing and artificial intelligence, would not disclose the exact amount, but said it was "probably more than hundreds of millions of dollars".

Phil Waymouth, a Microsoft executive in charge of strategic partnerships, pointed out that the scale of cloud computing infrastructure required by the OpenAI training model is unprecedented in the industry.

The size of the network GPU cluster is growing exponentially, more than anyone in the industry is trying to build.

Microsoft is determined to work with OpenAI because it firmly believes that this unprecedented scale of infrastructure will change history and create a new AI and a new programming platform to provide customers with products and services that are in their interests.

Now it seems that the hundreds of millions of dollars have obviously not been wasted-the bet is right.

In this super calculation, the model that OpenAI can train is more and more powerful, and unlock the amazing function of AI tool, almost start the ChatGPT of the fourth industrial revolution of mankind, thus was born.

Microsoft, who was very satisfied, poured another $10 billion on OpenAI in early January.

It can be said that Microsoft's ambition to break through the boundaries of AI has paid off. What is reflected behind this is the transformation from laboratory research to AI industrialization.

At present, Microsoft's office software empire has begun to take shape.

ChatGPT version of Bing can help us search holiday schedules; chatbots in Viva Sales can help marketers write emails; GitHub Copilot can help developers write code; Azure OpenAI services can give us access to OpenAI's large language model and Azure's enterprise-level features.

In fact, in November last year, Microsoft announced that it would work with Nvidia to build "one of the most powerful AI supercomputers in the world" to handle the huge computing load needed to train and expand AI.

The supercomputer is based on Microsoft's Azure cloud infrastructure and uses tens of thousands of Nvidia H100 and A100Tensor Core GPU, as well as its Quantum-2 InfiniBand network platform.

The supercomputer can be used to study and accelerate generative AI models such as DALL-E and Stable Diffusion, Nvidia said in a statement.

As AI researchers begin to use more powerful GPU to handle more complex AI workloads, they see the greater potential of AI models that understand nuances well and can handle many different language tasks at the same time.

To put it simply, the larger the model, the more data you have, and the longer you can train, the better the accuracy of the model.

But these larger models will soon reach the boundaries of existing computing resources. Microsoft knows what the supercomputer OpenAI needs to look like and how big it needs to be.

This is obviously not something you can start working together after simply buying a bunch of GPU and connecting them together.

Nidhi Chappell, head of high-performance computing and artificial intelligence at Microsoft Azure, said: "We need to train larger models for longer, which means not only do you need to have the largest infrastructure, but you also have to make it run reliably over a long period of time. "

Alistair Speirs, director of global infrastructure at Azure, says Microsoft must make sure it can cool all these machines and chips. For example, the use of external air in cooler climates and the use of high-tech evaporative coolers in hot climates.

In addition, because all machines are started at the same time, Microsoft has to consider the location of them and the power supply. Just like what happens when you turn on the microwave, toaster and vacuum cleaner in the kitchen at the same time, it's just the data center version.

Where is the key to completing these breakthroughs in large-scale AI training?

The problem is how to build, operate and maintain tens of thousands of co-located GPU interconnected on high-throughput, low-latency InfiniBand networks.

This scale, which has gone far beyond the scope of testing by GPU and network equipment vendors, is completely unknown. No one knows whether the hardware will collapse on this scale.

Nidhi Chappell, head of high-performance computing and artificial intelligence products at Microsoft Azure, explained that in the training process of LLM, the large-scale computing involved is usually divided into thousands of GPU in a cluster.

In a phase called allreduce, GPU exchanges information about the work they do. At this point, acceleration needs to be done through the InfiniBand network so that GPU can be completed before the next block of computation begins.

Nidhi Chappell says that because these efforts span thousands of GPU, in addition to ensuring the reliability of the infrastructure, a large number of system-level optimizations are needed to achieve optimal performance, which has been summed up over many generations.

The so-called system-level optimization includes software that can make effective use of GPU and network equipment.

In the past few years, Microsoft has developed this technology to increase the ability to train models with trillions of parameters while reducing the resource requirements and time required for training and providing these models in production.

Waymouth noted that Microsoft and its partners have also been gradually increasing the capacity of the GPU cluster and developing the InfiniBand network to see how much they can drive the data center infrastructure needed to keep the GPU cluster running, including cooling systems, uninterruptible power systems, and backup generators.

Eric Boyd, vice president of Microsoft AI platform, said that this supercomputing power, optimized for large language model training and the next wave of AI innovation, is already available directly in Azure cloud services.

And Microsoft has accumulated a lot of experience by working with OpenAI, and when other partners find and want the same infrastructure, Microsoft can also provide it.

Today, Microsoft's Azure data center has covered more than 60 regions around the world.

New virtual machine: ND H100 v5 Microsoft continues to improve on the above infrastructure.

Today, Microsoft announced new and massively scalable virtual machines that integrate the latest NVIDIA H100 Tensor Core GPU and NVIDIA Quantum-2 InfiniBand networks.

Through virtual machines, Microsoft can provide customers with the infrastructure to scale according to the size of any AI task. According to Microsoft, Azure's new ND H100 v5 virtual machine provides developers with excellent performance, calling thousands of GPU at the same time.

Reference:

Https://news.microsoft.com/source/features/ai/how-microsofts-bet-on-azure-unlocked-an-ai-revolution/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report