In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
Microsoft created dedicated supercomputing for ChatGPT, spending hundreds of millions of dollars and using tens of thousands of A100s. Now, for the first time, Google has released the details of its AI supercomputing-its performance is 10 times better than the previous generation v3 and 1.7 times better than the A100. In addition, it is said that chips that can compete with the H100 are already being developed.
Although Google deployed the most powerful AI chip at that time, TPU v4, in its own data center as early as 2020.
But it wasn't until April 4 of this year that Google first released the technical details of the AI supercomputer.
The performance of https://arxiv.org/ abs / 2304.01433 is 2.1 times higher than that of TPU v3. After integrating 4096 chips, the performance of supercomputing is improved by 10 times.
In addition, Google claims that its chips are faster and more energy efficient than the Nvidia A100.
In the paper, TPU says that for systems of the same size, Google v4 can provide 1.7 times better performance than the Nvidia A100, while improving energy efficiency by 1.9 times.
In addition, Google's supercomputing speed is about 4.3 to 4.5 times faster than Graphcore IPU Bow.
Google showed the TPU v4 package, as well as four packages installed on the circuit board.
Like TPU v3, each TPU v4 contains two TensorCore (TC). Each TC contains four 128x128 matrix multiplication units (MXU), a vector processing unit (VPU) with 128 channels (16 ALU per channel) and a 16 MiB vector memory (VMEM).
The two TC share a 128MiB common memory (CMEM).
It is worth noting that the A100 chip is available at the same time as Google's fourth-generation TPU, so how does its specific performance compare?
Google demonstrated the fastest performance of each MLPerf in five DSA benchmarks. These include BERT, ResNET, DLRM, RetinaNet, MaskRCNN.
Where Graphcore IPU submitted the results in BERT and ResNET.
The results of the two systems in ResNet and BERT are shown below. The dotted line between the points is based on the interpolation of the number of chips.
The MLPerf results of both TPU v4 and A100 extend to larger systems than IPU (4096 chips versus 256chips).
For systems of similar size, TPU v4 is 1.15 times faster than A100 on BERT and about 4.3 times faster than IPU. For ResNet,TPU v4, it is 1.67 times faster and about 4.5 times faster.
For power usage in the MLPerf benchmark, the A100 used an average of 1.3x to 1.9x power.
Can peak floating-point operations per second predict actual performance? Many people in the field of machine learning think that peak floating point operations per second is a good proxy indicator of performance, but in fact this is not the case.
For example, despite only a 1.10x advantage in peak floating-point operations per second, TPU v4 is 4.3x to 4.5x faster than IPU Bow on systems of the same size on both MLPerf benchmarks.
Another example is that the peak floating-point operations per second of the A100 is 1.13 times that of the TPU v4, but for the same number of chips, the TPU v4 is 1.15 to 1.67 times faster.
The relationship between peak FLOPS / s and memory bandwidth is shown using the Roofline model as shown in the figure below.
So the question is, why doesn't Google compare with Nvidia's latest H100?
Google said that because the H100 was made using newer technology after the launch of Google chips, it did not compare its fourth generation products with Nvidia's current flagship H100 chip.
However, Google hinted that it was working on a new TPU to compete with the Nvidia H100, but did not provide details. Google researcher Jouppi told Reuters that Google has a "production line for future chips."
While TPU vs GPU is fighting a life-and-death battle between ChatGPT and Bard, two behemoths are working hard behind the scenes to keep them running-the GPU (graphics processing unit) supported by Nvidia CUDA and Google's custom TPU (tensor processing unit).
In other words, this is no longer about the confrontation between ChatGPT and Bard, but between TPU and GPU, and how they effectively multiply matrices.
Because of its excellent design in hardware architecture, Nvidia's GPU is well suited for matrix multiplication tasks-it can effectively implement parallel processing between multiple CUDA cores.
Therefore, since 2012, training models on GPU has become a consensus in the field of deep learning, which has not changed so far.
With the introduction of NVIDIA DGX, Nvidia is able to provide one-stop hardware and software solutions for almost all AI tasks, which competitors cannot provide due to lack of intellectual property rights.
Google, by contrast, launched the first generation of tensor processing units (TPU) in 2016, which includes not only custom ASIC (application specific integrated circuits) optimized for tensor computing, but also its own TensorFlow framework. This also gives TPU an advantage in AI computing tasks other than matrix multiplication, and even speeds up fine-tuning and reasoning tasks.
In addition, researchers at Google DeepMind have found a way to create a better matrix multiplication algorithm-AlphaTensor.
However, even though Google has achieved good results through self-developed technology and emerging AI computing optimization methods, the long-standing deep cooperation between Microsoft and Nvidia has expanded the competitive advantages of both sides by taking advantage of their respective industry accumulation.
The fourth generation of TPU time goes back to the 21-year Google I / O conference, and firewood chopping announced for the first time the latest generation of Google AI chip TPU v4.
"this is the fastest system we have ever deployed on Google and it is a historic milestone for us. "
This improvement has become a key point of competition among companies building AI supercomputing, as large language models such as Google's Bard or OpenAI's ChatGPT have exploded in parameter size.
This means that they are much larger than the capacity that a single chip can store, and it is a huge "black hole" for computing.
So these large models must be distributed over thousands of chips, and then the chips must work together for weeks or more to train the model.
So far, Google's largest publicly disclosed language model, PaLM, with 540 billion parameters, was split into two 4000-chip supercomputers within 50 days for training.
Google says its supercomputers can easily reconfigure connections between chips, avoid problems and tune performance.
Google researcher Norm Jouppi and Google Distinguished engineer David Patterson wrote in a blog post about the system
"Circuit switching makes it easy to bypass failed components. This flexibility even allows us to change the topology of supercomputing interconnection to speed up the performance of the machine learning model. "
Although Google has only now released details about its supercomputer, the supercomputer has been online in a data center in Metz County, Oklahoma, since 2020.
Google said Midjourney used the system to train its models, and the latest version of V5 showed everyone how amazing the images were.
Recently, chopping wood told the New York Times that Bard will be transferred from LaMDA to PaLM.
Now with the blessing of TPU v4, Bard will only become stronger.
Reference:
Https://www.reuters.com/technology/google-says-its-ai-supercomputer-is-faster-greener-than-nvidia-2023-04-05/
Https://analyticsindiamag.com/forget-chatgpt-vs-bard-the-real-battle-is-gpus-vs-tpus/
This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.