NVIDIA GPU: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
Line 114: Line 114:
* '''''SM30 or <code>SM_30, compute_30</code> –'''''  '''''Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730).'''''  '''''Adds support for unified memory programmingCompletely dropped from CUDA 11 onwards.'''''
* '''''SM30 or <code>SM_30, compute_30</code> –'''''  '''''Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730).'''''  '''''Adds support for unified memory programmingCompletely dropped from CUDA 11 onwards.'''''
* '''''SM35 or <code>SM_35, compute_35</code> –'''''  '''''Tesla K40.''''' '''''Adds support for dynamic parallelism.'''''  '''''Deprecated from CUDA 11, will be dropped in future versions.'''''
* '''''SM35 or <code>SM_35, compute_35</code> –'''''  '''''Tesla K40.''''' '''''Adds support for dynamic parallelism.'''''  '''''Deprecated from CUDA 11, will be dropped in future versions.'''''
* '''''SM37 or <code>SM_37, compute_37</code> –'''''  '''''Tesla K80.''''' '''''Adds a few more registers.'''''  '''''Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a 32GB PCIe Tesla V100.'''''
* '''''SM37 or <code>SM_37, compute_37</code> –'''''  '''''Tesla K80.''''' '''''Adds a few more registers.'''''  '''''Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a 32GB [[PCIe]] Tesla V100.'''''
|-
|-
|Maxwell
|Maxwell
Line 231: Line 231:
!RT Cores
!RT Cores
!Memory Size
!Memory Size
!MIG
!Memory Type
!Memory Type
!Memory Bandwidth
!Memory Bandwidth
Line 242: Line 243:
|No
|No
|1.5 GB GDDR3
|1.5 GB GDDR3
|
|GDDR3
|GDDR3
|76.8 GB/s
|76.8 GB/s
Line 253: Line 255:
|No
|No
|4 GB GDDR3
|4 GB GDDR3
|
|GDDR3
|GDDR3
|102 GB/s
|102 GB/s
Line 264: Line 267:
|No
|No
|4 GB GDDR3
|4 GB GDDR3
|
|GDDR3
|GDDR3
|102 GB/s
|102 GB/s
Line 275: Line 279:
|No
|No
|3 GB GDDR5
|3 GB GDDR5
|
|GDDR5
|GDDR5
|148 GB/s
|148 GB/s
Line 286: Line 291:
|No
|No
|6 GB GDDR5
|6 GB GDDR5
|
|GDDR5
|GDDR5
|150 GB/s
|150 GB/s
Line 297: Line 303:
|No
|No
|8 GB GDDR5
|8 GB GDDR5
|
|GDDR5
|GDDR5
|320 GB/s
|320 GB/s
Line 308: Line 315:
|No
|No
|5/6 GB GDDR5
|5/6 GB GDDR5
|
|GDDR5
|GDDR5
|208 GB/s
|208 GB/s
Line 319: Line 327:
|No
|No
|12 GB GDDR5
|12 GB GDDR5
|
|GDDR5
|GDDR5
|288 GB/s
|288 GB/s
Line 330: Line 339:
|No
|No
|24 GB GDDR5
|24 GB GDDR5
|
|GDDR5
|GDDR5
|480 GB/s
|480 GB/s
Line 341: Line 351:
|No
|No
|12 GB GDDR5
|12 GB GDDR5
|
|GDDR5
|GDDR5
|288 GB/s
|288 GB/s
Line 352: Line 363:
|No
|No
|8 GB GDDR5
|8 GB GDDR5
|
|GDDR5
|GDDR5
|192 GB/s
|192 GB/s
Line 363: Line 375:
|No
|No
|24 GB GDDR5X
|24 GB GDDR5X
|
|GDDR5X
|GDDR5X
|480 GB/s
|480 GB/s
Line 374: Line 387:
|Yes
|Yes
|16/32 GB HBM2
|16/32 GB HBM2
|
|HBM2
|HBM2
|900 GB/s
|900 GB/s
Line 385: Line 399:
|No
|No
|16 GB
|16 GB
|
|
|
|
|
Line 396: Line 411:
|Yes
|Yes
|40 GB HBM2 / 80 GB HBM2
|40 GB HBM2 / 80 GB HBM2
|
|HBM2
|HBM2
|1555 GB/s
|1555 GB/s
Line 407: Line 423:
|Yes
|Yes
|40 GB HBM2 / 80 GB HBM2
|40 GB HBM2 / 80 GB HBM2
|
|HBM2
|HBM2
|1555 GB/s
|1555 GB/s
Line 418: Line 435:
|No
|No
|24 GB GDDR6
|24 GB GDDR6
|
|GDDR6
|GDDR6
|696 GB/s
|696 GB/s
Line 429: Line 447:
|No
|No
|48 GB GDDR6
|48 GB GDDR6
|
|GDDR6
|GDDR6
|696 GB/s
|696 GB/s
Line 440: Line 459:
|No
|No
|24 GB GDDR6
|24 GB GDDR6
|
|GDDR6
|GDDR6
|624 GB/s
|624 GB/s
Line 451: Line 471:
|No
|No
|48 GB GDDR6
|48 GB GDDR6
|
|GDDR6
|GDDR6
|768 GB/s
|768 GB/s
Line 462: Line 483:
|Yes
|Yes
|80 GB HBM2
|80 GB HBM2
|Up to 7
MIGs @
10GB
|HBM2
|HBM2
|2025 GB/s
|1935GB/s
|400W
|300W
|Apr 2021
|Apr 2021
|-
|-
Line 473: Line 498:
|Yes
|Yes
|40 GB HBM2
|40 GB HBM2
|Up to 7
MIGs @
5GB
|HBM2
|HBM2
|1555 GB/s
|1555 GB/s
Line 484: Line 513:
|Yes
|Yes
|80 GB HBM2 / 160 GB HBM2
|80 GB HBM2 / 160 GB HBM2
|
|HBM2
|HBM2
|2050 GB/s
|2050 GB/s
Line 495: Line 525:
|Yes
|Yes
|80 GB HBM2 / 160 GB HBM2
|80 GB HBM2 / 160 GB HBM2
|
|HBM2
|HBM2
|2050 GB/s
|2050 GB/s
Line 506: Line 537:
|Yes
|Yes
|24 GB GDDR6
|24 GB GDDR6
|
|GDDR6
|GDDR6
|768 GB/s
|768 GB/s
Line 517: Line 549:
|Yes
|Yes
|16 GB GDDR6
|16 GB GDDR6
|
|GDDR6
|GDDR6
|512 GB/s
|512 GB/s
Line 528: Line 561:
|Yes
|Yes
|24 GB G
|24 GB G
|
|
|
|
|
Line 539: Line 573:
|Yes
|Yes
|24 GB GDDR6
|24 GB GDDR6
|
|GDDR6
|GDDR6
|672 GB/s
|672 GB/s
Line 550: Line 585:
|Yes
|Yes
|24 GB GDDR6X
|24 GB GDDR6X
|
|GDDR6X
|GDDR6X
|936 GB/s
|936 GB/s
Line 561: Line 597:
|Yes
|Yes
|12 GB GDDR6X
|12 GB GDDR6X
|
|GDDR6X
|GDDR6X
|912 GB/s
|912 GB/s
Line 572: Line 609:
|Yes
|Yes
|10 GB GDDR6X
|10 GB GDDR6X
|
|GDDR6X
|GDDR6X
|760 GB/s
|760 GB/s
Line 583: Line 621:
|Yes
|Yes
|8 GB GDDR6X
|8 GB GDDR6X
|
|GDDR6X
|GDDR6X
|608 GB/s
|608 GB/s
Line 594: Line 633:
|Yes
|Yes
|8 GB GDDR6
|8 GB GDDR6
|
|GDDR6
|GDDR6
|448 GB/s
|448 GB/s
Line 605: Line 645:
|Yes
|Yes
|8 GB GDDR6
|8 GB GDDR6
|
|GDDR6
|GDDR6
|448 GB/s
|448 GB/s
Line 616: Line 657:
|No
|No
|12 GB GDDR6
|12 GB GDDR6
|
|GDDR6
|GDDR6
|360 GB/s
|360 GB/s
Line 627: Line 669:
|Yes
|Yes
|48 GB GDDR6
|48 GB GDDR6
|
|GDDR6
|GDDR6
|624 GB/s
|624 GB/s
Line 638: Line 681:
|Yes
|Yes
|24 GB GDDR6
|24 GB GDDR6
|
|GDDR6
|GDDR6
|432 GB/s
|432 GB/s
Line 649: Line 693:
|Yes
|Yes
|16 GB GDDR6
|16 GB GDDR6
|
|GDDR6
|GDDR6
|448 GB/s
|448 GB/s
Line 660: Line 705:
|Yes
|Yes
|8 GB GDDR6
|8 GB GDDR6
|
|GDDR6
|GDDR6
|416 GB/s
|416 GB/s
Line 671: Line 717:
|Yes
|Yes
|24 GB
|24 GB
|
|
|
|
|
Line 682: Line 729:
|
|
|12 GB HBM2
|12 GB HBM2
|
|HBM2
|HBM2
|652.8 GB/s
|652.8 GB/s
Line 693: Line 741:
|
|
|16 GB HBM2
|16 GB HBM2
|
|HBM2
|HBM2
|900 GB/s
|900 GB/s
Line 704: Line 753:
|
|
|16 GB HBM2
|16 GB HBM2
|
|HBM2
|HBM2
|900 GB/s
|900 GB/s
Line 715: Line 765:
|
|
|32 GB HBM2
|32 GB HBM2
|
|HBM2
|HBM2
|870 GB/s
|870 GB/s
Line 726: Line 777:
|
|
|32 GB HBM2
|32 GB HBM2
|
|HBM2
|HBM2
|900 GB/s
|900 GB/s
Line 737: Line 789:
|
|
|16 x 32 GB HBM2 (512 GB total)
|16 x 32 GB HBM2 (512 GB total)
|
|HBM2
|HBM2
|2.7 TB/s
|2.7 TB/s

Revision as of 12:22, 5 April 2023

GPU Tenser performance notes for RTX 4090

According to this thread NVIDIA looks cut the tensor FP16 & TF32 operation rate in half, resulting in a 4090 with even lower FP16 & TF32 performance than the 4080 16GB. This may have been done to prevent the 4090 from cannibalizing the Quadro/Tesla sales. So if you are choosing GPUs, you can choose the 4090 for memory, but lower tensor performance than the 4080 16GB. eventhough 4090 has more than twice the ray tracing performance of the 4080 12GB.

RTX 4090 RTX 4080 16GB RTX 4080 12GB RTX 3090 Ti
non-tensor FP32 tflops 82.6 (206%) 48.7 (122%) 40.1 (100%) 40 (100%)
non-tensor FP16 tflops 82.6 (206%) 48.7 (122%) 40.1 (100%) 40 (100%)
Tensor Cores 512 (152%) 304 (90%) 240 (71%) 336 (100%)
Optical flow TOPS 305 (242%) 305 (242%) 305 (242%) 126 (100%)
tensor FP16 w/ FP32 accumulate TFLOPS ** 165.2 (207%) 194.9 (244%) 160.4 (200%) 80 (100%)
tensor TF32 TFLOPS ** 82.6 (207%) 97.5 (244%) 80.2 (200%) 40 (100%)
Ray trace Cores 128 (152%) 76 (90%) 60 (71%) 84 (100%)
Ray trace TFLOPS 191 (245%) 112.7 (144%) 92.7 (119%) 78.1 (100%)
POWER (W) 450 (100%) 320 (71%) 285 (63%) 450 (100%)

NVIDIA GPU Architecture

nvcc sm flags and what they’re used for: When compiling with NVCC[1],

  • the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
  • Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Matching CUDA arch and CUDA gencode for various NVIDIA architectures

Series Architecture

(--arch)

CUDA gencode

(--sm)

Compute Capability Notable Models Supported CUDA version Key Features
Tesla Tesla 1.0, 1.1, 2.0, 2.1 C1060, M2050, K80, P100, V100, A100 First dedicated GPGPU series
Fermi Fermi sm_20 3.0, 3.1 GTX 400, GTX 500, Tesla 20-series, Quadro 4000/5000 CUDA 3.2 until CUDA 8 First to feature CUDA cores and support for ECC memory
  • SM20 or SM_20, compute_30 – GeForce 400, 500, 600, GT-630. Completely dropped from CUDA 10 onwards.
Kepler Kepler sm_30

sm_35, sm_37

3.2, 3.5, 3.7 GTX 600, GTX 700, Tesla K-series, Quadro K-series CUDA 5 until CUDA 10 First to feature Dynamic Parallelism and Hyper-Q
  • SM30 or SM_30, compute_30 Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730). Adds support for unified memory programmingCompletely dropped from CUDA 11 onwards.
  • SM35 or SM_35, compute_35 Tesla K40. Adds support for dynamic parallelism. Deprecated from CUDA 11, will be dropped in future versions.
  • SM37 or SM_37, compute_37 Tesla K80. Adds a few more registers. Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a 32GB PCIe Tesla V100.
Maxwell Maxwell sm_50,

sm_52, sm_53

5.0, 5.2 GTX 900, GTX 1000, Quadro M-series CUDA 6 until CUDA 11 First to support VR and 4K displays
  • SM50 or SM_50, compute_50 Tesla/Quadro M series. Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a Quadro RTX 4000 or A6000.
  • SM52 or SM_52, compute_52 Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X.
  • SM53 or SM_53, compute_53 Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.
Pascal Pascal sm_60,

sm_61, sm_62

6.0, 6.1, 6.2 GTX 1000, Quadro P-series CUDA 8 and later First to support simultaneous multi-projection
  • SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
  • SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
  • SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2
Volta Volta sm_70,

sm_72 (Xavier)

7.0, 7.2, 7.5 Titan V, Tesla V100, Quadro GV100 CUDA 9 and later First to feature Tensor Cores and NVLink 2.0
  • SM70 or SM_70, compute_70 – DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
  • SM72 or SM_72, compute_72 – Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX
Turing Turing sm_75 7.5, 7.6 RTX 2000, GTX 1600, Quadro RTX CUDA 10 and later First to feature Ray Tracing Cores and RTX technology
  • SM75 or SM_75, compute_75 – GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4
  • Turing GPU
Ampere Ampere sm_80,

sm_86, sm_87 (Orin)

8.0, 8.6 RTX 3000, A-series CUDA 11.1 and later Features third-generation Tensor Cores and more
  • Ampere GPU
  • SM80 or SM_80, compute_80 – NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100
  • SM86 or SM_86, compute_86 (from CUDA 11.1 onwards) Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, A5000, A6000, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU
  • SM87 or SM_87, compute_87 (from CUDA 11.4 onwards, introduced with PTX ISA 7.4 / Driver r470 and newer) – for Jetson AGX Orin and Drive AGX Orin only

Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.

Lovelace Ada Lovelace[2] sm_89 8.7, 8.9 GeForce RTX 4070 Ti (AD104)

GeForce RTX 4080 (AD103)

GeForce RTX 4090 (AD102)

Nvidia RTX 6000 Ada Generation (AD102, formerly Quadro)

Nvidia L40 (AD102, formerly Tesla)

CUDA 11.8 and later

cuDNN 8.6 and later

  • Fourth-Gen Tensor Cores increasing throughput by up to 5X, to 1.4 Tensor-petaFLOPS using the new FP8 Transformer Engine (like H100 model)
  • Third-generation RT Cores have twice the ray-triangle intersection throughput, increasing RT-TFLOP performance by over 2x
  • The new RT Cores also include a new Opacity Micromap (OMM) Engine and a new Displaced Micro-Mesh (DMM) Engine. The OMM Engine enables much faster ray tracing of alpha-tested textures often used for foliage, particles, and fences. The DMM Engine delivers up to 10X faster Bounding Volume Hierarchy (BVH) build time with up to 20X less BVH storage space, enabling real-time ray tracing of geometrically complex scenes
  • Shader Execution Reordering (SER) technology dynamically reorganizes these previously inefficient workloads into considerably more efficient ones. SER can improve shader performance for ray tracing operations by up to 3X, and in-game frame rates by up to 25%.
  • DLSS 3 is a revolutionary breakthrough in AI-powered graphics that massively boosts performance. Powered by the new fourth-gen Tensor Cores and Optical Flow Accelerator on GeForce RTX 40 Series GPUs, DLSS 3 uses AI to create additional high-quality frames
  • Graphics cards built upon the Ada architecture feature new eighth generation NVIDIA Encoders (NVENC) with AV1 encoding, enabling a raft of new possibilities for streamers, broadcasters, and video callers.
  • It’s 40% more efficient than H.264 and allows users who are streaming at 1080p to increase their stream resolution to 1440p while running at the same bitrate and quality.
  • SM89 or SM_89, compute_89 – NVIDIA GeForce RTX 4090, RTX 4080, RTX 6000, Tesla L40
Hopper Hopper sm_90, sm_90a(Thor) 9.0 CUDA 12 and later TODO
  • SM90 or SM_90, compute_90 – NVIDIA H100 (GH100)
  • SM90a or SM_90a, compute_90a – (for PTX ISA version 8.0) – adds acceleration for features like wgmma and setmaxnreg. This is required for NVIDIA CUTLASS

NVIDIA GPU Models

Model Architecture CUDA Cores Tensor Cores RT Cores Memory Size MIG Memory Type Memory Bandwidth TDP Launch Date
Tesla C870 Tesla 128 No No 1.5 GB GDDR3 GDDR3 76.8 GB/s 105W Jun 2006
Tesla C1060 Tesla 240 No No 4 GB GDDR3 GDDR3 102 GB/s 238W Dec 2008
Tesla M1060 Tesla 240 No No 4 GB GDDR3 GDDR3 102 GB/s 225W Dec 2008
Tesla M2050 Fermi 448 No No 3 GB GDDR5 GDDR5 148 GB/s 225W May 2010
Tesla M2070 Fermi 448 No No 6 GB GDDR5 GDDR5 150 GB/s 225W May 2010
Tesla K10 Kepler 3072 No No 8 GB GDDR5 GDDR5 320 GB/s 225W May 2012
Tesla K20 Kepler 2496 No No 5/6 GB GDDR5 GDDR5 208 GB/s 225W Nov 2012
Tesla K40 Kepler 2880 No No 12 GB GDDR5 GDDR5 288 GB/s 235W Nov 2013
Tesla K80 Kepler 4992 No No 24 GB GDDR5 GDDR5 480 GB/s 300W Nov 2014
Tesla M40 Maxwell 3072 No No 12 GB GDDR5 GDDR5 288 GB/s 250W Nov 2015
Tesla P4 Pascal 2560 No No 8 GB GDDR5 GDDR5 192 GB/s 75W Sep 2016
Tesla P40 Pascal 3840 No No 24 GB GDDR5X GDDR5X 480 GB/s 250W Sep 2016
Tesla V100 Volta 5120 640 Yes 16/32 GB HBM2 HBM2 900 GB/s 300W May 2017
Tesla T4 Turing 2560 320 No 16 GB
A100 PCIe Ampere 6912 432 Yes 40 GB HBM2 / 80 GB HBM2 HBM2 1555 GB/s 250W May 2020
A100 SXM4 Ampere 6912 432 Yes 40 GB HBM2 / 80 GB HBM2 HBM2 1555 GB/s 400W May 2020
A30 Ampere 7424 184 No 24 GB GDDR6 GDDR6 696 GB/s 165W Apr 2021
A40 Ampere 10752 336 No 48 GB GDDR6 GDDR6 696 GB/s 300W Apr 2021
A10 Ampere 10240 320 No 24 GB GDDR6 GDDR6 624 GB/s 150W Mar 2021
A16 Ampere 16384 512 No 48 GB GDDR6 GDDR6 768 GB/s 250W Mar 2021
A100 80GB Ampere 6912 432 Yes 80 GB HBM2 Up to 7

MIGs @

10GB

HBM2 1935GB/s 300W Apr 2021
A100 40GB Ampere 6912 432 Yes 40 GB HBM2 Up to 7

MIGs @

5GB

HBM2 1555 GB/s 250W May 2020
A200 PCIe Ampere 10752 672 Yes 80 GB HBM2 / 160 GB HBM2 HBM2 2050 GB/s 400W Nov 2021
A200 SXM4 Ampere 10752 672 Yes 80 GB HBM2 / 160 GB HBM2 HBM2 2050 GB/s 400W Nov 2021
A5000 Ampere 8192 256 Yes 24 GB GDDR6 GDDR6 768 GB/s 230W Apr 2021
A4000 Ampere 6144 192 Yes 16 GB GDDR6 GDDR6 512 GB/s 140W Apr 2021
A3000 Ampere 3584 112 Yes 24 GB G
Titan RTX Turing 4608 576 Yes 24 GB GDDR6 GDDR6 672 GB/s 280W Dec 2018
GeForce RTX 3090 Turing 10496 328 Yes 24 GB GDDR6X GDDR6X 936 GB/s 350W Sep 2020
GeForce RTX 3080 Ti Turing 10240 320 Yes 12 GB GDDR6X GDDR6X 912 GB/s 350W May 2021
GeForce RTX 3080 Turing 8704 272 Yes 10 GB GDDR6X GDDR6X 760 GB/s 320W Sep 2020
GeForce RTX 3070 Ti Turing 6144 192 Yes 8 GB GDDR6X GDDR6X 608 GB/s 290W Jun 2021
GeForce RTX 3070 Turing 5888 184 Yes 8 GB GDDR6 GDDR6 448 GB/s 220W Oct 2020
GeForce RTX 3060 Ti Turing 4864 152 Yes 8 GB GDDR6 GDDR6 448 GB/s 200W Dec 2020
GeForce RTX 3060 Turing 3584 112 No 12 GB GDDR6 GDDR6 360 GB/s 170W Feb 2021
Quadro RTX 8000 Turing 4608 576 Yes 48 GB GDDR6 GDDR6 624 GB/s 295W Aug 2018
Quadro RTX 6000 Turing 4608 576 Yes 24 GB GDDR6 GDDR6 432 GB/s 260W Aug 2018
Quadro RTX 5000 Turing 3072 384 Yes 16 GB GDDR6 GDDR6 448 GB/s 230W Nov 2018
Quadro RTX 4000 Turing 2304 288 Yes 8 GB GDDR6 GDDR6 416 GB/s 160W Nov 2018
Titan RTX (T-Rex) Turing 4608 576 Yes 24 GB
Titan V Volta 5120 640 12 GB HBM2 HBM2 652.8 GB/s 250W Dec 2017
Tesla V100 (PCIe) Volta 5120 640 16 GB HBM2 HBM2 900 GB/s 250W June 2017
Tesla V100 (SXM2) Volta 5120 640 16 GB HBM2 HBM2 900 GB/s 300W June 2017
Quadro GV100 Volta 5120 640 32 GB HBM2 HBM2 870 GB/s 250W Mar 2018
Tesla GV100 (SXM2) Volta 5120 640 32 GB HBM2 HBM2 900 GB/s 300W Mar 2018
DGX-1 (Volta) Volta 5120 640 16 x 32 GB HBM2 (512 GB total) HBM2 2.7 TB/s 3200W Mar 2018

NVIDIA Features by Architecture[3]

NVIDIA Flagship Gaming GPUs
VideoCardz.com AD102 GA102 TU102 GV100 GP102
Launch Year 2022 2020 2018 2017 2017
Architecture Ada Lovelace Ampere Turing Volta Pascal
Node TSMC 4N SAMSUNG 8N TSMC 12nm TSMC 12nm TSMC 16nm
Die Size 608 mm² 628 mm² 754 mm² 815 mm² 471 mm²
Transistors 76.3B 28.3B 18.6B 21.1B 12.0B
Trans. Density 125.5M TRAN/mm2 45.1M TRAN/mm2 24.7M TRAN/mm2 25.9M TRAN/mm2 25.5M TRAN/mm2
CUDA Cores 18432 10752 4608 5120 3840
Tensor Cores 576 Gen4 336 Gen3 576 Gen2 640
RT Cores 144 Gen3 84 Gen2 72 Gen1
Memory Bus GDDR6X 384-bit GDDR6X 384-bit GDDR6 384-bit HBM2 3072-bit GDDR6X 384-bit

NVIDIA Grace Architecture

NVIDIA has announced that they will be partnering with server manufacturers such as HPE, Atos, and Supermicro to create servers that integrate the Grace architecture with ARM-based CPUs. These servers are expected to be available in the second half of 2023

Architecture Key Features
Grace CPU-GPU integration, ARM Neoverse CPU, HBM2E memory
900 GB/s memory bandwidth, support for PCIe 5.0 and NVLink
10x performance improvement for certain HPC workloads
Energy efficiency improvements through unified memory space

Reference