Tens of thousands of Nvidia chips + hundreds of millions of dollars invested by Microsoft to reveal the supercomputer behind ChatGPT 04/26 Update SLTechnology News&Howtos

Tens of thousands of Nvidia chips + hundreds of millions of dollars invested by Microsoft to reveal the supercomputer behind ChatGPT

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thanks to CTOnews.com netizen South China Wu Yanzu for the clue delivery! Beijing time on March 14 news, artificial intelligence chat robot ChatGPT once launched will be popular all over the world, but the outside world may not know, ChatGPT can be so intelligent, behind it is an expensive supercomputer built by Microsoft.

Microsoft supercomputers use tens of thousands of NVIDIA GPUs In 2019, when Microsoft invested $1 billion in ChatGPT developer OpenAI, it agreed to build a massive cutting-edge supercomputer for the AI research startup. The only problem: Microsoft doesn't have the kind of stuff OpenAI needs, and it's not entirely certain it can build something this big in Azure without breaking it.

At the time, OpenAI was trying to train a growing set of AI programs, known as "models," that were absorbing larger and larger amounts of data and learning more and more parameters. These parameters are variables that the AI system has learned through training and retraining. This means that OpenAI needs to use powerful cloud computing services for a long time.

Tens of thousands of chips, hundreds of millions of investments To overcome this challenge, Microsoft had to find ways to combine tens of thousands of NVIDIA A100 graphics chips (the main force for training artificial intelligence models) and change the position of servers on racks to prevent power outages. Scott Guthrie, Microsoft's executive vice president of cloud computing and artificial intelligence, declined to say how much the project would cost, but said it was "probably more" than a few hundred million dollars.

"We built a system architecture that works at hyperscale and is reliable. That's why ChatGPT is possible,"said Nidhi Chappell, general manager of Microsoft Azure AI infrastructure." It's one model that comes out of it, and there will be many, many others in the future." "

ChatGPT is supercomputer-trained, a technology that helped OpenAI launch ChatGPT, which attracted more than a million users within days of its launch in November and is now being incorporated into the business models of other companies, from those run by billionaire hedge fund founder Ken Griffin to delivery company Instacart. As generative AI tools such as ChatGPT gain increasing interest from businesses and consumers, cloud service providers such as Microsoft, Amazon and Google will face increased pressure to ensure their data centers can deliver the massive computing power they need.

Now Microsoft uses the same set of resources it built for OpenAI to train and run its own large AI models, including the new Bing search bot launched last month. Microsoft also sells the system to other customers. As part of Microsoft's expanded partnership agreement with OpenAI, with an additional $10 billion investment, the software giant is already working on the next generation of AI supercomputers.

"We didn't want to make it a custom product, it started out as a custom product, but we always managed to make it a generic product so anyone who wanted to train a large language model could take advantage of the same improvements," Guthrie said in an interview. "

Training a massive AI model requires having a large number of interconnected graphics processing units in one place, much like Microsoft assembled AI supercomputers. Once the model is in use, answering all of the queries posed by users-call it reasoning-requires slightly different settings. Microsoft also deploys graphics chips for reasoning, but these thousands of processors are geographically dispersed across the company's more than 60 data center regions. Microsoft said in a blog post Monday that it is now adding the latest Nvidia graphics chip H100 to AI workloads, as well as the latest version of Nvidia Infiniband networking technology to share data faster.

Microsoft Azure Cloud Services Currently, the new Bing Search is still in preview. Microsoft is gradually adding more users to its waiting list. Guthrie's team meets daily with about two dozen employees, known as "maintenance people," a term that originally referred to a group of mechanics who adjust cars during races. The team's job was to figure out how to get more computing power online quickly and solve problems that popped up.

"It's a lot like a meet-up, like,'Hey, anybody has a good idea, let's put it on the table today and discuss it, figure it out okay, can we save a few minutes here? Can we save a few hours? A few days? Guthrie said.

Cloud services rely on thousands of different parts and items, including server components, pipes, concrete for buildings, different metals and minerals, and delays or shortages in any one component, no matter how small, can lead to failure. Recently, maintenance crews have had to deal with a shortage of cable trays. A cable tray is a basket-like contraption used to hold cables that fall off a machine. So they designed a new cable tray that Microsoft could build or buy. Guthrie said they're also working on ways to compress as many servers as possible in existing data centers around the world so they don't have to wait for new buildings.

When OpenAI or Microsoft trains a large AI model, the work is done in one go. It is distributed across all GPUs, and at some point the units need to communicate with each other to share what they do. For AI supercomputers, Microsoft must ensure that the network devices that handle communication between all chips can handle this load, and must develop software that can fully utilize GPUs and network devices. The company now has software that can train models with tens of trillions of parameters.

Since all the machines start at the same time, Microsoft has to consider where they are placed and where the power is. Otherwise, Guthrie says, you end up with the data center version of the results, like you're in the kitchen turning on the microwave, toaster and vacuum simultaneously.

Alistair Speirs, director of global infrastructure for Microsoft Azure, a new generation of supercomputers, said the company must also ensure it can cool all these machines and chips and use evaporative, outdoor air in cooler climates and high-tech swamp coolers in hot climates.

Guthrie said Microsoft will continue to develop custom server and chip designs and find ways to optimize the supply chain to maximize speed, efficiency and cost savings.

"The models that amaze the world today are built on supercomputers that we started building a few years ago. The new model will be built on the new supercomputer we are training, which is bigger and more sophisticated. "

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.