NVIDIA GPU

NVIDIA GPU Architecture

nvcc sm flags and what they’re used for: When compiling with NVCC^[1],

the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Matching CUDA arch and CUDA gencode for various NVIDIA architectures

Series	Architecture (--arch)	CUDA gencode (--sm)	Compute Capability	Notable Models	Key Features
Tesla	Tesla		1.0, 1.1, 2.0, 2.1	C1060, M2050, K80, P100, V100, A100	First dedicated GPGPU series
Fermi	Fermi	sm_20	3.0, 3.1	GTX 400, GTX 500, Tesla 20-series, Quadro 4000/5000	First to feature CUDA cores and support for ECC memory † Fermi and Kepler are deprecated from CUDA 9 and 11 onwards
Kepler	Kepler	sm_30, sm_35, sm_37	3.2, 3.5, 3.7	GTX 600, GTX 700, Tesla K-series, Quadro K-series	First to feature Dynamic Parallelism and Hyper-Q
Maxwell	Maxwell	sm_50, sm_52, sm_53	5.0, 5.2	GTX 900, GTX 1000, Quadro M-series	First to support VR and 4K displays Maxwell is deprecated from CUDA 11.6 onwards
Pascal	Pascal	sm_60, sm_61, sm_62	6.0, 6.1, 6.2	GTX 1000, Quadro P-series	First to support simultaneous multi-projection
Volta	Volta	sm_70, sm_72 (Xavier)	7.0, 7.2, 7.5	Titan V, Tesla V100, Quadro GV100	First to feature Tensor Cores and NVLink 2.0
Turing	Turing	sm_75	7.5, 7.6	RTX 2000, GTX 1600, Quadro RTX	First to feature Ray Tracing Cores and RTX technology
Ampere	Ampere	sm_80, sm_86, sm_87 (Orin)	8.0, 8.6	RTX 3000, A-series	Features third-generation Tensor Cores and more
Lovelace	Ada Lovelace^[2]	sm_89	8.7, 8.9	GeForce RTX 4070 Ti (AD104) GeForce RTX 4080 (AD103) GeForce RTX 4090 (AD102) Nvidia RTX 6000 Ada Generation (AD102, formerly Quadro) Nvidia L40 (AD102, formerly Tesla)	Fourth-Gen Tensor Cores increasing throughput by up to 5X, to 1.4 Tensor-petaFLOPS using the new FP8 Transformer Engine (like H100 model) Third-generation RT Cores have twice the ray-triangle intersection throughput, increasing RT-TFLOP performance by over 2x The new RT Cores also include a new Opacity Micromap (OMM) Engine and a new Displaced Micro-Mesh (DMM) Engine. The OMM Engine enables much faster ray tracing of alpha-tested textures often used for foliage, particles, and fences. The DMM Engine delivers up to 10X faster Bounding Volume Hierarchy (BVH) build time with up to 20X less BVH storage space, enabling real-time ray tracing of geometrically complex scenes Shader Execution Reordering (SER) technology dynamically reorganizes these previously inefficient workloads into considerably more efficient ones. SER can improve shader performance for ray tracing operations by up to 3X, and in-game frame rates by up to 25%. DLSS 3 is a revolutionary breakthrough in AI-powered graphics that massively boosts performance. Powered by the new fourth-gen Tensor Cores and Optical Flow Accelerator on GeForce RTX 40 Series GPUs, DLSS 3 uses AI to create additional high-quality frames Graphics cards built upon the Ada architecture feature new eighth generation NVIDIA Encoders (NVENC) with AV1 encoding, enabling a raft of new possibilities for streamers, broadcasters, and video callers. It’s 40% more efficient than H.264 and allows users who are streaming at 1080p to increase their stream resolution to 1440p while running at the same bitrate and quality.
Hopper		sm_90, sm_90a(Thor)	9.0

NVIDIA GPU Models

Model	Architecture	CUDA Cores	Tensor Cores	RT Cores	Memory Size	Memory Type	Memory Bandwidth	TDP	Launch Date
Tesla C870	Tesla	128	No	No	1.5 GB GDDR3	GDDR3	76.8 GB/s	105W	Jun 2006
Tesla C1060	Tesla	240	No	No	4 GB GDDR3	GDDR3	102 GB/s	238W	Dec 2008
Tesla M1060	Tesla	240	No	No	4 GB GDDR3	GDDR3	102 GB/s	225W	Dec 2008
Tesla M2050	Fermi	448	No	No	3 GB GDDR5	GDDR5	148 GB/s	225W	May 2010
Tesla M2070	Fermi	448	No	No	6 GB GDDR5	GDDR5	150 GB/s	225W	May 2010
Tesla K10	Kepler	3072	No	No	8 GB GDDR5	GDDR5	320 GB/s	225W	May 2012
Tesla K20	Kepler	2496	No	No	5/6 GB GDDR5	GDDR5	208 GB/s	225W	Nov 2012
Tesla K40	Kepler	2880	No	No	12 GB GDDR5	GDDR5	288 GB/s	235W	Nov 2013
Tesla K80	Kepler	4992	No	No	24 GB GDDR5	GDDR5	480 GB/s	300W	Nov 2014
Tesla M40	Maxwell	3072	No	No	12 GB GDDR5	GDDR5	288 GB/s	250W	Nov 2015
Tesla P4	Pascal	2560	No	No	8 GB GDDR5	GDDR5	192 GB/s	75W	Sep 2016
Tesla P40	Pascal	3840	No	No	24 GB GDDR5X	GDDR5X	480 GB/s	250W	Sep 2016
Tesla V100	Volta	5120	640	Yes	16/32 GB HBM2	HBM2	900 GB/s	300W	May 2017
Tesla T4	Turing	2560	320	No	16 GB
A100 PCIe	Ampere	6912	432	Yes	40 GB HBM2 / 80 GB HBM2	HBM2	1555 GB/s	250W	May 2020
A100 SXM4	Ampere	6912	432	Yes	40 GB HBM2 / 80 GB HBM2	HBM2	1555 GB/s	400W	May 2020
A30	Ampere	7424	184	No	24 GB GDDR6	GDDR6	696 GB/s	165W	Apr 2021
A40	Ampere	10752	336	No	48 GB GDDR6	GDDR6	696 GB/s	300W	Apr 2021
A10	Ampere	10240	320	No	24 GB GDDR6	GDDR6	624 GB/s	150W	Mar 2021
A16	Ampere	16384	512	No	48 GB GDDR6	GDDR6	768 GB/s	400W	Mar 2021
A100 80GB	Ampere	6912	432	Yes	80 GB HBM2	HBM2	2025 GB/s	400W	Apr 2021
A100 40GB	Ampere	6912	432	Yes	40 GB HBM2	HBM2	1555 GB/s	250W	May 2020
A200 PCIe	Ampere	10752	672	Yes	80 GB HBM2 / 160 GB HBM2	HBM2	2050 GB/s	400W	Nov 2021
A200 SXM4	Ampere	10752	672	Yes	80 GB HBM2 / 160 GB HBM2	HBM2	2050 GB/s	400W	Nov 2021
A5000	Ampere	8192	256	Yes	24 GB GDDR6	GDDR6	768 GB/s	230W	Apr 2021
A4000	Ampere	6144	192	Yes	16 GB GDDR6	GDDR6	512 GB/s	140W	Apr 2021
A3000	Ampere	3584	112	Yes	24 GB G
Titan RTX	Turing	4608	576	Yes	24 GB GDDR6	GDDR6	672 GB/s	280W	Dec 2018
GeForce RTX 3090	Turing	10496	328	Yes	24 GB GDDR6X	GDDR6X	936 GB/s	350W	Sep 2020
GeForce RTX 3080 Ti	Turing	10240	320	Yes	12 GB GDDR6X	GDDR6X	912 GB/s	350W	May 2021
GeForce RTX 3080	Turing	8704	272	Yes	10 GB GDDR6X	GDDR6X	760 GB/s	320W	Sep 2020
GeForce RTX 3070 Ti	Turing	6144	192	Yes	8 GB GDDR6X	GDDR6X	608 GB/s	290W	Jun 2021
GeForce RTX 3070	Turing	5888	184	Yes	8 GB GDDR6	GDDR6	448 GB/s	220W	Oct 2020
GeForce RTX 3060 Ti	Turing	4864	152	Yes	8 GB GDDR6	GDDR6	448 GB/s	200W	Dec 2020
GeForce RTX 3060	Turing	3584	112	No	12 GB GDDR6	GDDR6	360 GB/s	170W	Feb 2021
Quadro RTX 8000	Turing	4608	576	Yes	48 GB GDDR6	GDDR6	624 GB/s	295W	Aug 2018
Quadro RTX 6000	Turing	4608	576	Yes	24 GB GDDR6	GDDR6	432 GB/s	260W	Aug 2018
Quadro RTX 5000	Turing	3072	384	Yes	16 GB GDDR6	GDDR6	448 GB/s	230W	Nov 2018
Quadro RTX 4000	Turing	2304	288	Yes	8 GB GDDR6	GDDR6	416 GB/s	160W	Nov 2018
Titan RTX (T-Rex)	Turing	4608	576	Yes	24 GB
Titan V	Volta	5120	640		12 GB HBM2	HBM2	652.8 GB/s	250W	Dec 2017
Tesla V100 (PCIe)	Volta	5120	640		16 GB HBM2	HBM2	900 GB/s	250W	June 2017
Tesla V100 (SXM2)	Volta	5120	640		16 GB HBM2	HBM2	900 GB/s	300W	June 2017
Quadro GV100	Volta	5120	640		32 GB HBM2	HBM2	870 GB/s	250W	Mar 2018
Tesla GV100 (SXM2)	Volta	5120	640		32 GB HBM2	HBM2	900 GB/s	300W	Mar 2018
DGX-1 (Volta)	Volta	5120	640		16 x 32 GB HBM2 (512 GB total)	HBM2	2.7 TB/s	3200W	Mar 2018

NVIDIA Grace Architecture

NVIDIA has announced that they will be partnering with server manufacturers such as HPE, Atos, and Supermicro to create servers that integrate the Grace architecture with ARM-based CPUs. These servers are expected to be available in the second half of 2023

Architecture	Key Features
Grace	CPU-GPU integration, ARM Neoverse CPU, HBM2E memory
	900 GB/s memory bandwidth, support for PCIe 5.0 and NVLink
	10x performance improvement for certain HPC workloads
	Energy efficiency improvements through unified memory space

Reference

[1] ttps://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

[2] ttps://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture)

[1]

[2]

NVIDIA GPU

Contents

NVIDIA GPU Architecture

NVIDIA GPU Models

NVIDIA Grace Architecture

Reference

Navigation menu

NVIDIA GPU

NVIDIA GPU Architecture

NVIDIA GPU Models

NVIDIA Grace Architecture

Reference

Navigation menu

Search