NVIDIA GPU

From HPCWIKI
Revision as of 15:09, 19 March 2023 by Admin (talk | contribs)
Jump to navigation Jump to search

NVIDIA GPU Architecture

nvcc sm flags and what they’re used for: When compiling with NVCC[1],

  • the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
  • Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.


Matching CUDA arch and CUDA gencode for various NVIDIA architectures

Series Architecture

(--arch)

CUDA gencode

(--sm)

Compute Capability Notable Models Key Features
Tesla Tesla 1.0, 1.1, 2.0, 2.1 C1060, M2050, K80, P100, V100, A100 First dedicated GPGPU series
Fermi Fermi sm_20 3.0, 3.1 GTX 400, GTX 500, Tesla 20-series, Quadro 4000/5000 First to feature CUDA cores and support for ECC memory

† Fermi and Kepler are deprecated from CUDA 9 and 11 onwards

Kepler Kepler sm_30,

sm_35, sm_37

3.2, 3.5, 3.7 GTX 600, GTX 700, Tesla K-series, Quadro K-series First to feature Dynamic Parallelism and Hyper-Q
Maxwell Maxwell sm_50,

sm_52, sm_53

5.0, 5.2 GTX 900, GTX 1000, Quadro M-series First to support VR and 4K displays

Maxwell is deprecated from CUDA 11.6 onwards

Pascal Pascal sm_60,

sm_61, sm_62

6.0, 6.1, 6.2 GTX 1000, Quadro P-series First to support simultaneous multi-projection
Volta Volta sm_70,

sm_72 (Xavier)

7.0, 7.2, 7.5 Titan V, Tesla V100, Quadro GV100 First to feature Tensor Cores and NVLink 2.0
Turing Turing sm_75 7.5, 7.6 RTX 2000, GTX 1600, Quadro RTX First to feature Ray Tracing Cores and RTX technology
Ampere Ampere sm_80,

sm_86, sm_87 (Orin)

8.0, 8.6 RTX 3000, A-series Features third-generation Tensor Cores and more
Lovelace Ada Lovelace[2] sm_89 8.7, 8.9 GeForce RTX 4070 Ti (AD104)

GeForce RTX 4080 (AD103)

GeForce RTX 4090 (AD102)

Nvidia RTX 6000 Ada Generation (AD102, formerly Quadro)

Nvidia L40 (AD102, formerly Tesla)

  • Fourth-Gen Tensor Cores increasing throughput by up to 5X, to 1.4 Tensor-petaFLOPS using the new FP8 Transformer Engine (like H100 model)
  • Third-generation RT Cores have twice the ray-triangle intersection throughput, increasing RT-TFLOP performance by over 2x
  • The new RT Cores also include a new Opacity Micromap (OMM) Engine and a new Displaced Micro-Mesh (DMM) Engine. The OMM Engine enables much faster ray tracing of alpha-tested textures often used for foliage, particles, and fences. The DMM Engine delivers up to 10X faster Bounding Volume Hierarchy (BVH) build time with up to 20X less BVH storage space, enabling real-time ray tracing of geometrically complex scenes
  • Shader Execution Reordering (SER) technology dynamically reorganizes these previously inefficient workloads into considerably more efficient ones. SER can improve shader performance for ray tracing operations by up to 3X, and in-game frame rates by up to 25%.
  • DLSS 3 is a revolutionary breakthrough in AI-powered graphics that massively boosts performance. Powered by the new fourth-gen Tensor Cores and Optical Flow Accelerator on GeForce RTX 40 Series GPUs, DLSS 3 uses AI to create additional high-quality frames
  • Graphics cards built upon the Ada architecture feature new eighth generation NVIDIA Encoders (NVENC) with AV1 encoding, enabling a raft of new possibilities for streamers, broadcasters, and video callers.
  • It’s 40% more efficient than H.264 and allows users who are streaming at 1080p to increase their stream resolution to 1440p while running at the same bitrate and quality.
Hopper sm_90, sm_90a(Thor) 9.0

NVIDIA GPU Models

Model Architecture CUDA Cores Tensor Cores RT Cores Memory Size Memory Type Memory Bandwidth TDP Launch Date
Tesla C870 Tesla 128 No No 1.5 GB GDDR3 GDDR3 76.8 GB/s 105W Jun 2006
Tesla C1060 Tesla 240 No No 4 GB GDDR3 GDDR3 102 GB/s 238W Dec 2008
Tesla M1060 Tesla 240 No No 4 GB GDDR3 GDDR3 102 GB/s 225W Dec 2008
Tesla M2050 Fermi 448 No No 3 GB GDDR5 GDDR5 148 GB/s 225W May 2010
Tesla M2070 Fermi 448 No No 6 GB GDDR5 GDDR5 150 GB/s 225W May 2010
Tesla K10 Kepler 3072 No No 8 GB GDDR5 GDDR5 320 GB/s 225W May 2012
Tesla K20 Kepler 2496 No No 5/6 GB GDDR5 GDDR5 208 GB/s 225W Nov 2012
Tesla K40 Kepler 2880 No No 12 GB GDDR5 GDDR5 288 GB/s 235W Nov 2013
Tesla K80 Kepler 4992 No No 24 GB GDDR5 GDDR5 480 GB/s 300W Nov 2014
Tesla M40 Maxwell 3072 No No 12 GB GDDR5 GDDR5 288 GB/s 250W Nov 2015
Tesla P4 Pascal 2560 No No 8 GB GDDR5 GDDR5 192 GB/s 75W Sep 2016
Tesla P40 Pascal 3840 No No 24 GB GDDR5X GDDR5X 480 GB/s 250W Sep 2016
Tesla V100 Volta 5120 640 Yes 16/32 GB HBM2 HBM2 900 GB/s 300W May 2017
Tesla T4 Turing 2560 320 No 16 GB
A100 PCIe Ampere 6912 432 Yes 40 GB HBM2 / 80 GB HBM2 HBM2 1555 GB/s 250W May 2020
A100 SXM4 Ampere 6912 432 Yes 40 GB HBM2 / 80 GB HBM2 HBM2 1555 GB/s 400W May 2020
A30 Ampere 7424 184 No 24 GB GDDR6 GDDR6 696 GB/s 165W Apr 2021
A40 Ampere 10752 336 No 48 GB GDDR6 GDDR6 696 GB/s 300W Apr 2021
A10 Ampere 10240 320 No 24 GB GDDR6 GDDR6 624 GB/s 150W Mar 2021
A16 Ampere 16384 512 No 48 GB GDDR6 GDDR6 768 GB/s 400W Mar 2021
A100 80GB Ampere 6912 432 Yes 80 GB HBM2 HBM2 2025 GB/s 400W Apr 2021
A100 40GB Ampere 6912 432 Yes 40 GB HBM2 HBM2 1555 GB/s 250W May 2020
A200 PCIe Ampere 10752 672 Yes 80 GB HBM2 / 160 GB HBM2 HBM2 2050 GB/s 400W Nov 2021
A200 SXM4 Ampere 10752 672 Yes 80 GB HBM2 / 160 GB HBM2 HBM2 2050 GB/s 400W Nov 2021
A5000 Ampere 8192 256 Yes 24 GB GDDR6 GDDR6 768 GB/s 230W Apr 2021
A4000 Ampere 6144 192 Yes 16 GB GDDR6 GDDR6 512 GB/s 140W Apr 2021
A3000 Ampere 3584 112 Yes 24 GB G
Titan RTX Turing 4608 576 Yes 24 GB GDDR6 GDDR6 672 GB/s 280W Dec 2018
GeForce RTX 3090 Turing 10496 328 Yes 24 GB GDDR6X GDDR6X 936 GB/s 350W Sep 2020
GeForce RTX 3080 Ti Turing 10240 320 Yes 12 GB GDDR6X GDDR6X 912 GB/s 350W May 2021
GeForce RTX 3080 Turing 8704 272 Yes 10 GB GDDR6X GDDR6X 760 GB/s 320W Sep 2020
GeForce RTX 3070 Ti Turing 6144 192 Yes 8 GB GDDR6X GDDR6X 608 GB/s 290W Jun 2021
GeForce RTX 3070 Turing 5888 184 Yes 8 GB GDDR6 GDDR6 448 GB/s 220W Oct 2020
GeForce RTX 3060 Ti Turing 4864 152 Yes 8 GB GDDR6 GDDR6 448 GB/s 200W Dec 2020
GeForce RTX 3060 Turing 3584 112 No 12 GB GDDR6 GDDR6 360 GB/s 170W Feb 2021
Quadro RTX 8000 Turing 4608 576 Yes 48 GB GDDR6 GDDR6 624 GB/s 295W Aug 2018
Quadro RTX 6000 Turing 4608 576 Yes 24 GB GDDR6 GDDR6 432 GB/s 260W Aug 2018
Quadro RTX 5000 Turing 3072 384 Yes 16 GB GDDR6 GDDR6 448 GB/s 230W Nov 2018
Quadro RTX 4000 Turing 2304 288 Yes 8 GB GDDR6 GDDR6 416 GB/s 160W Nov 2018
Titan RTX (T-Rex) Turing 4608 576 Yes 24 GB
Titan V Volta 5120 640 12 GB HBM2 HBM2 652.8 GB/s 250W Dec 2017
Tesla V100 (PCIe) Volta 5120 640 16 GB HBM2 HBM2 900 GB/s 250W June 2017
Tesla V100 (SXM2) Volta 5120 640 16 GB HBM2 HBM2 900 GB/s 300W June 2017
Quadro GV100 Volta 5120 640 32 GB HBM2 HBM2 870 GB/s 250W Mar 2018
Tesla GV100 (SXM2) Volta 5120 640 32 GB HBM2 HBM2 900 GB/s 300W Mar 2018
DGX-1 (Volta) Volta 5120 640 16 x 32 GB HBM2 (512 GB total) HBM2 2.7 TB/s 3200W Mar 2018

NVIDIA Grace Architecture

NVIDIA has announced that they will be partnering with server manufacturers such as HPE, Atos, and Supermicro to create servers that integrate the Grace architecture with ARM-based CPUs. These servers are expected to be available in the second half of 2023

Architecture Key Features
Grace CPU-GPU integration, ARM Neoverse CPU, HBM2E memory
900 GB/s memory bandwidth, support for PCIe 5.0 and NVLink
10x performance improvement for certain HPC workloads
Energy efficiency improvements through unified memory space

Reference