Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Comparison between NVIDIA Tesla/Quadro and GeForce GPU

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

This resource is written by Microway based on data provided by NVIDIA and trusted media sources. All NVIDIA GPU support general purpose computing (GPGPU), but not all GPU provide the same performance or support the same functionality. GeForce GPU's consumer product line, especially GTX Titan, may be attractive to those running GPU accelerated applications. However, it is wise to remember the differences between products. Professional Tesla and Quadro GPU have many functions.

FP64 64-bit (double precision) floating-point computation

Many applications require more accurate mathematical calculations. In these applications, the data is represented by a value twice as large (using 64 bits instead of 32 bits). These larger values are called double precision (64-bit). Less accurate values are called single precision (32 bits). Although almost all NVIDIA GPU products support single-precision and double-precision computing, the performance of double-precision values is much lower on most consumer-grade GeForce GPU. The following is a comparison of double-precision floating-point computing performance between GeForce and Tesla / Quadro GPU:

NVIDIA GPU model double precision (64-bit) floating-point performance GeForce GTX Titan X Maxwell up to 0.206 TFLOPSGeForce GTX 1080 Ti up to 0.355 TFLOPSGeForce Titan Xp up to 0.380 TFLOPSGeForce Titan V. As high as 6.875 TFLOPSGeForce RTX 2080 Ti estimates ~ 0.44 TFLOPS Tesla K801.87 + TFLOPS Tesla P100 * 4.7 TFLOPS 5.3 TFLOPSQuadro GP1005.2 TFLOPS Tesla V100 * 7 TFLOPS 7.8 TFLOPSQuadro GV1007.4 TFLOPSQuadro RTX 6000 and 8000mm 0.5 TFLOPS Tesla T4 estimates ~ 0.25 TFLOPS

* the exact value depends on PCI-Express or SXM2 SKU

FP16 16-bit (semi-precision) floating point calculation

Some applications do not require high precision (for example, neural network training / reasoning and some HPC usage). Support for half-precision FP16 operations is introduced in the "Pascal" GPU. This was the previous standard for deep learning / artificial intelligence computing; however, deep learning workloads have shifted to more complex operations (see TensorCores below). Although all NVIDIA "Pascal" and later GPU support FP16, performance degrades significantly on many game-centric GPU. The following is a comparison of semi-precision floating-point computing performance between GeForce and Tesla / Quadro GPU:

NVIDIA GPU model semi-precision (16-bit) floating-point performance GeForce GTX Titan X MaxwellN / AGeForce GTX 1080 Ti less than 0.177 TFLOPSGeForce Titan Xp less than 0.190 TFLOPSGeForce Titan V.V. 27.5 TFLOPSGeForce RTX 2080 Ti28.5 TFLOPS Tesla K80N / A Tesla P100 * 18.7mm 21.2 TFLOPS * Quadro GP10020.7 TFLOPS Tesla V100 * 28mm 31.4 TFLOPS * Quadro GV10029.6 TFLOPSQuadro RTX 6000 and 800032.6 TFLOPS Tesla T416.2 TFLOPS

* the exact value depends on PCI-Express or SXM2 SKU

TensorFLOPS and Deep Learning performance

A new professional Tensor Core unit is launched with "Volta" GPU. It combines two FP16 units (converted into full-precision products) with FP32 accumulation-a precise operation used in deep learning training calculations. NVIDIA now measures Tensor Core's GPU through a new deep learning performance metric: a new unit called TensorTFLOPS.

Tensor Core is only available for "Volta" GPU or later. As a reference, if there is no TensorFLOPS value, we will provide the maximum known deep learning performance with any precision. We believe that comparing performance between different precision is a very poor scientific method; however, we also recognize that we want to see at least one order of magnitude of performance comparison between the deep learning performance of different generations of GPU.

NVIDIA GPU model TensorFLOPS (or maximum DL performance) GeForce GTX Titan X MaxwellN / A TensorTFLOPS, ~ 6.1TFLOPS FP32GeForce GTX 1080 TiN / A TensorTFLOPS, ~ 11.3TFLOPS FP32GeForce Titan XpN / A TensorTFLOPS, ~ 12.1TFLOPS FP32GeForce Titan V. 110TensorTFLOPSGeForce RTX 2080 Ti56.9 TensorTFLOPS 455.4 TOPS, INT4 for reasoning Tesla K80N / A TensorTFLOPS, 5.6TFLOPS FP32 Tesla P100 * N / A TensorTFLOPS, 18.7mm 21.2 TFLOPS FP16Quadro GP100N / A TensorTFLOPS, 20.7TFLOPS FP16 Tesla V100 * 1125125TensorTFLOPSQuadro GV100118.5 TensorTFLOPSQuadro RTX 6000 and 8000130.5 TensorTFLOPS 522TOPS INT4 is used for reasoning Tesla T465 TensorTFLOPS 260 TOPS, INT4 is used for reasoning

* the exact value depends on PCI-Express or SXM2 SKU

Error detection and correction

On GPU running computer games, a memory error usually does not cause any problems (for example, one pixel per frame may not be the correct color). Users are not even likely to be aware of the problem. However, technical computing applications rely on the accuracy of the data returned by GPU. For some applications, a single error can lead to serious and obvious errors. For others, single-bit errors may not be easy to detect (return plausible error results). Titan GPU does not include error correction or error detection. If an error occurs, neither GPU nor the system will warn the user of the error. Users can detect errors (whether they cause the application to crash, obviously incorrect data, or less obvious incorrect data). These problems are not uncommon-our technicians often encounter memory errors on the consumer game GPU. NVIDIA Tesla GPU can correct single-bit errors and detect and warn double-bit errors. On the latest Tesla V100 Tesla T4 Tesla P100 and Quadro GV100 / GP100 GPU, ECC support is included in main HBM2 memory, as well as register files, shared memory, L1 cache and L2 cache.

Warranty and end user license Agreement

NVIDIA's warranty for GeForce GPU products clearly states that GeForce products are not suitable for installation on the server. Running GeForce GPU in the server system will invalidate the warranty of GPU at your own risk. From the manufacturer's warranty website of NVIDIA:

Ensure that the product is for consumer end-user use only and is not suitable for data center use and / or GPU cluster commercial deployment ("enterprise use"). The use of warranty products for enterprise use will invalidate this warranty.

The license agreement statement that came with the NVIDIA GeForce product driver software:

Data center deployment is prohibited. In addition to allowing blockchain processing in the data center, the software is not licensed for data center deployment.

GPU memory performance

Compute-intensive applications require high-performance computing units, but fast access to data is also critical. For many HPC applications, unless memory performance is also improved, the improvement in computing performance will not help. As a result, Tesla GPU provides better actual performance than GeForce GPU:

NVIDIA GPU model GPU memory bandwidth GeForce GTX Titan X Maxwell336 GB / sGeForce GTX 1080 Ti484 GB / sGeForce Titan Xp548 GB / sGeForce Titan V.653 GB / sGeForce RTX 2080 Ti616 GB / s Tesla K80480 GB / s Tesla P40346 GB / s Tesla P100 12GB549 GB / s Tesla P100 16GB732 GB / sQuadro GP100717 GB / s Tesla V100 16GB / 32GB900 GB / sQuadro GV100870 GB / sQuadro RTX 6000 and 8000624 GB / s Tesla T4320 GB / sGPU memory size

In general, the more memory the system runs, the faster it runs. For some HPC applications, you cannot even perform a single run unless you have enough memory. For others, unless there is enough memory, the quality and fidelity of the results will be reduced. Tesla GPU provides twice as much memory as GeForce GPU:

GPU model memory capacity GeForce GTX 1080 Ti11GBGeForce Titan Xp12GBGeForce GTX Titan V.12GBGeForce RTX 2080 Ti11GB Tesla K8024GB Tesla P4024GB Tesla P10012GB or 16GB * Quadro GP10016GB * Tesla V10016GB or 32GB * Quadro GV10032GB * Quadro RTX 600024GB * Quadro RTX 800048GB * Tesla T416GB *

* Please note that Tesla / Quadro unified memory allows GPU to share each other's memory to load larger datasets.

PCI-E and NVLink-device-to-host and device-to-device throughput

One of the biggest potential bottlenecks is waiting for data to be transferred to GPU. When multiple GPU are running in parallel, there are additional bottlenecks. Faster data transfer directly results in faster application performance. GeForce GPU is connected through PCI-Express, and its theoretical peak throughput is 16GB / s. NVIDIA Tesla / Quadro GPU with NVLink enables faster connectivity. The NVLink in NVIDIA's "Pascal" allows each GPU to communicate at speeds up to 80GB / s (160GB / s bidirectional). NVLink 2.0 in NVIDIA's "Volta" series allows each GPU to communicate at speeds up to 150GB / s (300GB / s bidirectional). NVLink connections are supported between GPU and between CPU and GPU on supported OpenPOWER platforms.

Application software support

While some software programs can run on any GPU that supports CUDA, others are designed and optimized for the professional GPU series. Most professional packages only officially support NVIDIA Tesla and Quadro GPU. It is possible to use GeForce GPU, but software vendors will not support it. In other cases, when starting on GeForce GPU (for example, Schr ö dinger,LLC 's software product), the application doesn't work at all.

Operating system support

Although the GPU driver for NVIDIA is very flexible, there is no GeForce driver for the Windows Server operating system. GeForce GPU is only supported on Windows 7 Magi Windows 8 and Windows 10. Groups that use Windows Server should use NVIDIA professional Tesla and Quadro GPU products. The Linux driver, on the other hand, supports all NVIDIA GPU.

Product life cycle

Due to the nature of the consumer GPU market, the life cycle of GeForce products is relatively short (usually no more than one year between product release and the end of production). Items that require a longer product life (such as those that may require replacement parts more than 3 years after purchase) should use a professional GPU. NVIDIA's professional Tesla and Quadro GPU products have an extended life cycle and long-term support from manufacturers (including end-of-life notice and final purchase opportunities before stopping production). In addition, professional GPU has undergone a more thorough testing and verification process in the production process.

Power efficiency

GeForce GPU is suitable for consumer games and is not usually designed to improve power efficiency. By contrast, Tesla GPU is designed for large-scale deployments, where power efficiency is very important. This makes Tesla GPU a better choice for large installations. For example, GeForce GTX Titan X is ideal for desktop deep learning workloads. In server deployments, the Tesla P40 GPU provides matching performance and double memory capacity. However, when placed side by side, Tesla consumes less electricity and generates less heat.

DMA engine

GPU's Direct memory access (DMA) engine allows for fast data transfer between system memory and GPU memory. Because such transfers are part of any real-world application, performance is critical to GPU acceleration. Slow transfer causes the GPU core to be idle until the data reaches GPU memory. Similarly, a slow return causes CPU to wait until the GPU completes and returns the result.

GeForce products have a single DMA engine * that can transmit data in one direction at a time. If you are uploading data to GPU, you cannot return any results calculated by GPU until the upload is complete. Similarly, the results returned from GPU will block any new data that needs to be uploaded to GPU. Tesla GPU products use dual DMA engines to alleviate this bottleneck. Data can be transferred to GPU and GPU at the same time.

* one GeForce GPU model, GeForce GTX Titan X, with dual DMA engines

GPU direct RDMA

NVIDIA's GPU-Direct technology can greatly improve the speed of data transmission between GPU. Various functions are protected by GPU-Direct, but the RDMA feature provides the greatest performance improvement. Traditionally, sending data between the GPU of a cluster requires three copies of memory (once to GPU's system memory, once to CPU's system memory, and once to the InfiniBand driver's memory). GPU Direct RDMA removes the system memory copy, allowing GPU to send data directly to the remote system via InfiniBand. In fact, for small MPI message sizes, this reduces latency by up to 67% and increases bandwidth by 430% [1]. In CUDA version 8.0, NVIDIA introduced GPU Direct RDMA ASYNC, which allows GPU to start RDMA transport without any interaction with CPU.

GeForce GPU does not support GPU-Direct RDMA. Although the MPI call will still return successfully, the transfer will be performed through the standard memory copy path. The only form of GPU-Direct supported by GeForce cards is GPU Direct Peer-to-Peer (P2P). This allows fast transfer within a single computer, but nothing is done for applications that run across multiple servers / compute nodes. Tesla GPU fully supports GPU Direct RDMA and various other GPU Direct features. They are the main goals of these functions, so they have been tested and used most frequently in this area.

Hyper-Q

The Hyper-Q agent for MPI and CUDA Streams allows multiple CPU threads or processes to start work on a single GPU. This is particularly important for existing parallel applications written in MPI because the code is designed to take advantage of multiple CPU cores. Allowing GPU to accept the work of each MPI thread running on the system can provide a potentially significant performance improvement. It also reduces the amount of source code rearchitecture required to add GPU acceleration to existing applications. However, the only Hyper-Q form supported by GeForce GPU is CUDA Streams's Hyper-Q. This allows GeForce to effectively accept and run parallel computing from different CPU cores, but applications running across multiple computers will not be able to effectively start work on GPU.

GPU health monitoring and management functions

Many health monitoring and GPU management features, critical to maintaining multiple GPU systems, are supported only on professional Tesla GPU. Health features not supported by GeForce GPU include:

NVML/nvidia-smi is used to monitor and manage the status and functionality of each GPU. This enables many third-party applications and tools such as Ganglia to support GPU. Perl and Python bindings are also available. OOB (out-of-band monitoring via IPMI) allows the system to monitor GPU health, adjust fan speed to properly cool the device and send alarms when problems are found InfoROM (persistent configuration and status data) provides the system with additional data about each GPU the NVHealthmon utility provides a ready-to-use GPU health tool for cluster administrators TCC allows you to specifically set GPU to display-only or compute-only ECC (memory error detection and correction)

The clustering tool relies on the functionality provided by NVIDIA NVML. About 60% of the features are not available on GeForce-this table provides a more detailed comparison of the NVML features supported by Tesla and GeForce GPU: the feature TeslaGeforce product name is to show that the GPU count is generated by PCI-Express (for example, 2.0 vs 3.0) is-PCI-Express link width (for example, x4cue x8 X16) Yes-current fan speed is current temperature is * current performance is-clock throttling status is-current GPU utilization (percentage) is current memory utilization (percentage) is GPU enhancement capability is ^ ECC error detection / correction support is-list retired pages is-current power draw is-set power limit is-current GPU clock speed is-current Memory clock speed is-display available clock speed is-display available memory speed is-set GPU promotion speed (core clock and memory clock) is-show current calculation process is-card serial number is-InfoROM image and object is-accounting ability (resource use for each process) is-PCI-Express ID is NVIDIA driver version is NVIDIA VBIOS version is

* the system platform cannot read the temperature, which means that the fan speed cannot be adjusted.

Disable GPU Boost during double precision calculation. In addition, in some cases, the GeForce clock speed will automatically slow down.

GPU acceleration

All the latest NVIDIA GPU products support GPU Boost, but their implementation varies depending on the expected usage scenario. GeForce cards are designed for interactive desktop use and games. Tesla GPU is designed for intensive, constant number of operations with high stability and reliability. In view of the difference between the two use cases, the function of GPU Boost on Tesla is different from that on GeForce.

How GPU runs on GeForce

In the case of Geforce, the video card automatically determines the clock speed and voltage according to the temperature of the GPU. Temperature is an appropriate independent variable because heating affects the speed of the fan. For games with fewer graphics or general desktop use, end users can enjoy a quieter computing experience. However, when playing games that require strict GPU calculations, GPU Boost automatically increases the voltage and clock speed (and makes more noise).

How GPU works on Tesla

On the other hand, Tesla's GPU acceleration level can also be determined by voltage and temperature, but it does not always operate in this way.

If you prefer, the enhancement can be specified by the system administrator or the computing user-you can set the desired clock speed to a specific frequency. In addition to floating the clock speed to various levels, the desired clock speed can be statically maintained unless a power consumption threshold (TDP) is reached. This is an important consideration because accelerators in HPC environments usually need to synchronize with each other. The optional deterministic aspect enhanced by Tesla GPU allows the system administrator to determine the optimal clock speed and lock it in all GPU.

For applications that require additional performance and certainty, the latest Tesla GPU can be set to automatic boost within the synchronous boost group. Enable groups when automatic enhancement is enabled, each group of GPU will increase the clock speed when headroom permits. The team will keep clocks synchronized with each other to ensure matching performance for the entire group. You can set up groups in the NVIDIA DCGM tool.

Https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report