NVIDIA GPU: Difference between revisions

Revision as of 12:22, 5 April 2023

GPU Tenser performance notes for RTX 4090

According to this thread NVIDIA looks cut the tensor FP16 & TF32 operation rate in half, resulting in a 4090 with even lower FP16 & TF32 performance than the 4080 16GB. This may have been done to prevent the 4090 from cannibalizing the Quadro/Tesla sales. So if you are choosing GPUs, you can choose the 4090 for memory, but lower tensor performance than the 4080 16GB. eventhough 4090 has more than twice the ray tracing performance of the 4080 12GB.

	RTX 4090	RTX 4080 16GB	RTX 4080 12GB	RTX 3090 Ti
non-tensor FP32 tflops	82.6 (206%)	48.7 (122%)	40.1 (100%)	40 (100%)
non-tensor FP16 tflops	82.6 (206%)	48.7 (122%)	40.1 (100%)	40 (100%)
Tensor Cores	512 (152%)	304 (90%)	240 (71%)	336 (100%)
Optical flow TOPS	305 (242%)	305 (242%)	305 (242%)	126 (100%)
tensor FP16 w/ FP32 accumulate TFLOPS **	165.2 (207%)	194.9 (244%)	160.4 (200%)	80 (100%)
tensor TF32 TFLOPS **	82.6 (207%)	97.5 (244%)	80.2 (200%)	40 (100%)
Ray trace Cores	128 (152%)	76 (90%)	60 (71%)	84 (100%)
Ray trace TFLOPS	191 (245%)	112.7 (144%)	92.7 (119%)	78.1 (100%)
POWER (W)	450 (100%)	320 (71%)	285 (63%)	450 (100%)

NVIDIA GPU Architecture

nvcc sm flags and what they’re used for: When compiling with NVCC^[1],

the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Matching CUDA arch and CUDA gencode for various NVIDIA architectures

Series	Architecture (--arch)	CUDA gencode (--sm)	Compute Capability	Notable Models	Supported CUDA version	Key Features
Tesla	Tesla		1.0, 1.1, 2.0, 2.1	C1060, M2050, K80, P100, V100, A100		First dedicated GPGPU series
Fermi	Fermi	sm_20	3.0, 3.1	GTX 400, GTX 500, Tesla 20-series, Quadro 4000/5000	CUDA 3.2 until CUDA 8	First to feature CUDA cores and support for ECC memory SM20 or SM_20, compute_30 – GeForce 400, 500, 600, GT-630. Completely dropped from CUDA 10 onwards.
Kepler	Kepler	sm_30 sm_35, sm_37	3.2, 3.5, 3.7	GTX 600, GTX 700, Tesla K-series, Quadro K-series	CUDA 5 until CUDA 10	First to feature Dynamic Parallelism and Hyper-Q SM30 or `SM_30, compute_30` – Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730). Adds support for unified memory programmingCompletely dropped from CUDA 11 onwards. SM35 or `SM_35, compute_35` – Tesla K40. Adds support for dynamic parallelism. Deprecated from CUDA 11, will be dropped in future versions. SM37 or `SM_37, compute_37` – Tesla K80. Adds a few more registers. Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a 32GB PCIe Tesla V100.
Maxwell	Maxwell	sm_50, sm_52, sm_53	5.0, 5.2	GTX 900, GTX 1000, Quadro M-series	CUDA 6 until CUDA 11	First to support VR and 4K displays SM50 or `SM_50, compute_50` – Tesla/Quadro M series. Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a Quadro RTX 4000 or A6000. SM52 or `SM_52, compute_52` – Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X. SM53 or `SM_53, compute_53` – Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.
Pascal	Pascal	sm_60, sm_61, sm_62	6.0, 6.1, 6.2	GTX 1000, Quadro P-series	CUDA 8 and later	First to support simultaneous multi-projection SM60 or `SM_60, compute_60` – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal) SM61 or `SM_61, compute_61`– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2 SM62 or `SM_62, compute_62` – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2
Volta	Volta	sm_70, sm_72 (Xavier)	7.0, 7.2, 7.5	Titan V, Tesla V100, Quadro GV100	CUDA 9 and later	First to feature Tensor Cores and NVLink 2.0 SM70 or `SM_70, compute_70` – DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100 SM72 or `SM_72, compute_72` – Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX
Turing	Turing	sm_75	7.5, 7.6	RTX 2000, GTX 1600, Quadro RTX	CUDA 10 and later	First to feature Ray Tracing Cores and RTX technology SM75 or `SM_75, compute_75` – GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4 Turing GPU
Ampere	Ampere	sm_80, sm_86, sm_87 (Orin)	8.0, 8.6	RTX 3000, A-series	CUDA 11.1 and later	Features third-generation Tensor Cores and more Ampere GPU SM80 or `SM_80, compute_80` – NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100 SM86 or `SM_86, compute_86` – (from CUDA 11.1 onwards) Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, A5000, A6000, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU SM87 or `SM_87, compute_87` – (from CUDA 11.4 onwards, introduced with PTX ISA 7.4 / Driver r470 and newer) – for Jetson AGX Orin and Drive AGX Orin only “Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.“
Lovelace	Ada Lovelace^[2]	sm_89	8.7, 8.9	GeForce RTX 4070 Ti (AD104) GeForce RTX 4080 (AD103) GeForce RTX 4090 (AD102) Nvidia RTX 6000 Ada Generation (AD102, formerly Quadro) Nvidia L40 (AD102, formerly Tesla)	CUDA 11.8 and later cuDNN 8.6 and later	Fourth-Gen Tensor Cores increasing throughput by up to 5X, to 1.4 Tensor-petaFLOPS using the new FP8 Transformer Engine (like H100 model) Third-generation RT Cores have twice the ray-triangle intersection throughput, increasing RT-TFLOP performance by over 2x The new RT Cores also include a new Opacity Micromap (OMM) Engine and a new Displaced Micro-Mesh (DMM) Engine. The OMM Engine enables much faster ray tracing of alpha-tested textures often used for foliage, particles, and fences. The DMM Engine delivers up to 10X faster Bounding Volume Hierarchy (BVH) build time with up to 20X less BVH storage space, enabling real-time ray tracing of geometrically complex scenes Shader Execution Reordering (SER) technology dynamically reorganizes these previously inefficient workloads into considerably more efficient ones. SER can improve shader performance for ray tracing operations by up to 3X, and in-game frame rates by up to 25%. DLSS 3 is a revolutionary breakthrough in AI-powered graphics that massively boosts performance. Powered by the new fourth-gen Tensor Cores and Optical Flow Accelerator on GeForce RTX 40 Series GPUs, DLSS 3 uses AI to create additional high-quality frames Graphics cards built upon the Ada architecture feature new eighth generation NVIDIA Encoders (NVENC) with AV1 encoding, enabling a raft of new possibilities for streamers, broadcasters, and video callers. It’s 40% more efficient than H.264 and allows users who are streaming at 1080p to increase their stream resolution to 1440p while running at the same bitrate and quality. SM89 or `SM_89, compute_`89 – NVIDIA GeForce RTX 4090, RTX 4080, RTX 6000, Tesla L40
Hopper	Hopper	sm_90, sm_90a(Thor)	9.0		CUDA 12 and later	TODO SM90 or `SM_90, compute_90` – NVIDIA H100 (GH100) SM90a or `SM_90a, compute_90a` – (for PTX ISA version 8.0) – adds acceleration for features like `wgmma` and `setmaxnreg`. This is required for NVIDIA CUTLASS

NVIDIA GPU Models

Model	Architecture	CUDA Cores	Tensor Cores	RT Cores	Memory Size	MIG	Memory Type	Memory Bandwidth	TDP	Launch Date
Tesla C870	Tesla	128	No	No	1.5 GB GDDR3		GDDR3	76.8 GB/s	105W	Jun 2006
Tesla C1060	Tesla	240	No	No	4 GB GDDR3		GDDR3	102 GB/s	238W	Dec 2008
Tesla M1060	Tesla	240	No	No	4 GB GDDR3		GDDR3	102 GB/s	225W	Dec 2008
Tesla M2050	Fermi	448	No	No	3 GB GDDR5		GDDR5	148 GB/s	225W	May 2010
Tesla M2070	Fermi	448	No	No	6 GB GDDR5		GDDR5	150 GB/s	225W	May 2010
Tesla K10	Kepler	3072	No	No	8 GB GDDR5		GDDR5	320 GB/s	225W	May 2012
Tesla K20	Kepler	2496	No	No	5/6 GB GDDR5		GDDR5	208 GB/s	225W	Nov 2012
Tesla K40	Kepler	2880	No	No	12 GB GDDR5		GDDR5	288 GB/s	235W	Nov 2013
Tesla K80	Kepler	4992	No	No	24 GB GDDR5		GDDR5	480 GB/s	300W	Nov 2014
Tesla M40	Maxwell	3072	No	No	12 GB GDDR5		GDDR5	288 GB/s	250W	Nov 2015
Tesla P4	Pascal	2560	No	No	8 GB GDDR5		GDDR5	192 GB/s	75W	Sep 2016
Tesla P40	Pascal	3840	No	No	24 GB GDDR5X		GDDR5X	480 GB/s	250W	Sep 2016
Tesla V100	Volta	5120	640	Yes	16/32 GB HBM2		HBM2	900 GB/s	300W	May 2017
Tesla T4	Turing	2560	320	No	16 GB
A100 PCIe	Ampere	6912	432	Yes	40 GB HBM2 / 80 GB HBM2		HBM2	1555 GB/s	250W	May 2020
A100 SXM4	Ampere	6912	432	Yes	40 GB HBM2 / 80 GB HBM2		HBM2	1555 GB/s	400W	May 2020
A30	Ampere	7424	184	No	24 GB GDDR6		GDDR6	696 GB/s	165W	Apr 2021
A40	Ampere	10752	336	No	48 GB GDDR6		GDDR6	696 GB/s	300W	Apr 2021
A10	Ampere	10240	320	No	24 GB GDDR6		GDDR6	624 GB/s	150W	Mar 2021
A16	Ampere	16384	512	No	48 GB GDDR6		GDDR6	768 GB/s	250W	Mar 2021
A100 80GB	Ampere	6912	432	Yes	80 GB HBM2	Up to 7 MIGs @ 10GB	HBM2	1935GB/s	300W	Apr 2021
A100 40GB	Ampere	6912	432	Yes	40 GB HBM2	Up to 7 MIGs @ 5GB	HBM2	1555 GB/s	250W	May 2020
A200 PCIe	Ampere	10752	672	Yes	80 GB HBM2 / 160 GB HBM2		HBM2	2050 GB/s	400W	Nov 2021
A200 SXM4	Ampere	10752	672	Yes	80 GB HBM2 / 160 GB HBM2		HBM2	2050 GB/s	400W	Nov 2021
A5000	Ampere	8192	256	Yes	24 GB GDDR6		GDDR6	768 GB/s	230W	Apr 2021
A4000	Ampere	6144	192	Yes	16 GB GDDR6		GDDR6	512 GB/s	140W	Apr 2021
A3000	Ampere	3584	112	Yes	24 GB G
Titan RTX	Turing	4608	576	Yes	24 GB GDDR6		GDDR6	672 GB/s	280W	Dec 2018
GeForce RTX 3090	Turing	10496	328	Yes	24 GB GDDR6X		GDDR6X	936 GB/s	350W	Sep 2020
GeForce RTX 3080 Ti	Turing	10240	320	Yes	12 GB GDDR6X		GDDR6X	912 GB/s	350W	May 2021
GeForce RTX 3080	Turing	8704	272	Yes	10 GB GDDR6X		GDDR6X	760 GB/s	320W	Sep 2020
GeForce RTX 3070 Ti	Turing	6144	192	Yes	8 GB GDDR6X		GDDR6X	608 GB/s	290W	Jun 2021
GeForce RTX 3070	Turing	5888	184	Yes	8 GB GDDR6		GDDR6	448 GB/s	220W	Oct 2020
GeForce RTX 3060 Ti	Turing	4864	152	Yes	8 GB GDDR6		GDDR6	448 GB/s	200W	Dec 2020
GeForce RTX 3060	Turing	3584	112	No	12 GB GDDR6		GDDR6	360 GB/s	170W	Feb 2021
Quadro RTX 8000	Turing	4608	576	Yes	48 GB GDDR6		GDDR6	624 GB/s	295W	Aug 2018
Quadro RTX 6000	Turing	4608	576	Yes	24 GB GDDR6		GDDR6	432 GB/s	260W	Aug 2018
Quadro RTX 5000	Turing	3072	384	Yes	16 GB GDDR6		GDDR6	448 GB/s	230W	Nov 2018
Quadro RTX 4000	Turing	2304	288	Yes	8 GB GDDR6		GDDR6	416 GB/s	160W	Nov 2018
Titan RTX (T-Rex)	Turing	4608	576	Yes	24 GB
Titan V	Volta	5120	640		12 GB HBM2		HBM2	652.8 GB/s	250W	Dec 2017
Tesla V100 (PCIe)	Volta	5120	640		16 GB HBM2		HBM2	900 GB/s	250W	June 2017
Tesla V100 (SXM2)	Volta	5120	640		16 GB HBM2		HBM2	900 GB/s	300W	June 2017
Quadro GV100	Volta	5120	640		32 GB HBM2		HBM2	870 GB/s	250W	Mar 2018
Tesla GV100 (SXM2)	Volta	5120	640		32 GB HBM2		HBM2	900 GB/s	300W	Mar 2018
DGX-1 (Volta)	Volta	5120	640		16 x 32 GB HBM2 (512 GB total)		HBM2	2.7 TB/s	3200W	Mar 2018

NVIDIA Features by Architecture^[3]

NVIDIA Flagship Gaming GPUs
VideoCardz.com	AD102	GA102	TU102	GV100	GP102
Launch Year	2022	2020	2018	2017	2017
Architecture	Ada Lovelace	Ampere	Turing	Volta	Pascal
Node	TSMC 4N	SAMSUNG 8N	TSMC 12nm	TSMC 12nm	TSMC 16nm
Die Size	608 mm²	628 mm²	754 mm²	815 mm²	471 mm²
Transistors	76.3B	28.3B	18.6B	21.1B	12.0B
Trans. Density	125.5M ^TRAN/_mm²	45.1M ^TRAN/_mm²	24.7M ^TRAN/_mm²	25.9M ^TRAN/_mm²	25.5M ^TRAN/_mm²
CUDA Cores	18432	10752	4608	5120	3840
Tensor Cores	576 Gen4	336 Gen3	576 Gen2	640	–
RT Cores	144 Gen3	84 Gen2	72 Gen1	–	–
Memory Bus	GDDR6X 384-bit	GDDR6X 384-bit	GDDR6 384-bit	HBM2 3072-bit	GDDR6X 384-bit

NVIDIA Grace Architecture

NVIDIA has announced that they will be partnering with server manufacturers such as HPE, Atos, and Supermicro to create servers that integrate the Grace architecture with ARM-based CPUs. These servers are expected to be available in the second half of 2023

Architecture	Key Features
Grace	CPU-GPU integration, ARM Neoverse CPU, HBM2E memory
	900 GB/s memory bandwidth, support for PCIe 5.0 and NVLink
	10x performance improvement for certain HPC workloads
	Energy efficiency improvements through unified memory space

Reference

[1] ttps://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

[2] ttps://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture)

[3] ttps://videocardz.com/newz/nvidia-details-ad102-gpu-up-to-18432-cuda-cores-76-3b-transistors-and-608-mm%C2%B2

[1]

[2]

[3]

NVIDIA GPU: Difference between revisions

Revision as of 12:22, 5 April 2023

Contents

GPU Tenser performance notes for RTX 4090

NVIDIA GPU Architecture

NVIDIA GPU Models

NVIDIA Features by Architecture^[3]

NVIDIA Grace Architecture

Reference

Navigation menu

NVIDIA GPU: Difference between revisions

Revision as of 12:22, 5 April 2023

GPU Tenser performance notes for RTX 4090

NVIDIA GPU Architecture

NVIDIA GPU Models

NVIDIA Features by Architecture[3]

NVIDIA Grace Architecture

Reference

Navigation menu

Search

NVIDIA Features by Architecture^[3]