NVIDIA GPU: Difference between revisions

Revision as of 10:12, 11 July 2023

HPCMATE provides all level of GPU model as air-cooling or liquid-cooling version for any type of server or workstation.

GPU Tenser performance notes for RTX 4090

According to this thread NVIDIA looks cut the tensor FP16 & TF32 operation rate in half, resulting in a 4090 with even lower FP16 & TF32 performance than the 4080 16GB. This may have been done to prevent the 4090 from cannibalizing the Quadro/Tesla sales. So if you are choosing GPUs, you can choose the 4090 for memory, but lower tensor performance than the 4080 16GB. eventhough 4090 has more than twice the ray tracing performance of the 4080 12GB.

	RTX 4090	RTX 4080 16GB	RTX 4080 12GB	RTX 3090 Ti
non-tensor FP32 tflops	82.6 (206%)	48.7 (122%)	40.1 (100%)	40 (100%)
non-tensor FP16 tflops	82.6 (206%)	48.7 (122%)	40.1 (100%)	40 (100%)
Tensor Cores	512 (152%)	304 (90%)	240 (71%)	336 (100%)
Optical flow TOPS	305 (242%)	305 (242%)	305 (242%)	126 (100%)
tensor FP16 w/ FP32 accumulate TFLOPS **	165.2 (207%)	194.9 (244%)	160.4 (200%)	80 (100%)
tensor TF32 TFLOPS **	82.6 (207%)	97.5 (244%)	80.2 (200%)	40 (100%)
Ray trace Cores	128 (152%)	76 (90%)	60 (71%)	84 (100%)
Ray trace TFLOPS	191 (245%)	112.7 (144%)	92.7 (119%)	78.1 (100%)
POWER (W)	450 (100%)	320 (71%)	285 (63%)	450 (100%)

NVIDIA GPU Architecture

nvcc sm flags and what they’re used for: When compiling with NVCC^[1],

the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Matching CUDA arch and CUDA gencode for various NVIDIA architectures

Series	Architecture (--arch)	CUDA gencode (--sm)	Compute Capability	Notable Models	Supported CUDA version	Key Features
Tesla	Tesla		1.0, 1.1, 2.0, 2.1	C1060, M2050, K80, P100, V100, A100		First dedicated GPGPU series
Fermi	Fermi	sm_20	3.0, 3.1	GTX 400, GTX 500, Tesla 20-series, Quadro 4000/5000	CUDA 3.2 until CUDA 8	First to feature CUDA cores and support for ECC memory SM20 or SM_20, compute_30 – GeForce 400, 500, 600, GT-630. Completely dropped from CUDA 10 onwards.
Kepler	Kepler	sm_30 sm_35, sm_37	3.2, 3.5, 3.7	GTX 600, GTX 700, Tesla K-series, Quadro K-series	CUDA 5 until CUDA 10	First to feature Dynamic Parallelism and Hyper-Q SM30 or `SM_30, compute_30` – Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730). Adds support for unified memory programmingCompletely dropped from CUDA 11 onwards. SM35 or `SM_35, compute_35` – Tesla K40. Adds support for dynamic parallelism. Deprecated from CUDA 11, will be dropped in future versions. SM37 or `SM_37, compute_37` – Tesla K80. Adds a few more registers. Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a 32GB PCIe Tesla V100.
Maxwell	Maxwell	sm_50, sm_52, sm_53	5.0, 5.2	GTX 900, GTX 1000, Quadro M-series	CUDA 6 until CUDA 11	First to support VR and 4K displays SM50 or `SM_50, compute_50` – Tesla/Quadro M series. Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a Quadro RTX 4000 or A6000. SM52 or `SM_52, compute_52` – Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X. SM53 or `SM_53, compute_53` – Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.
Pascal	Pascal	sm_60, sm_61, sm_62	6.0, 6.1, 6.2	GTX 1000, Quadro P-series	CUDA 8 and later	First to support simultaneous multi-projection SM60 or `SM_60, compute_60` – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal) SM61 or `SM_61, compute_61`– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2 SM62 or `SM_62, compute_62` – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2
Volta	Volta	sm_70, sm_72 (Xavier)	7.0, 7.2, 7.5	Titan V, Tesla V100, Quadro GV100	CUDA 9 and later	First to feature Tensor Cores and NVLink 2.0 SM70 or `SM_70, compute_70` – DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100 SM72 or `SM_72, compute_72` – Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX
Turing	Turing	sm_75	7.5, 7.6	RTX 2000, GTX 1600, Quadro RTX	CUDA 10 and later	First to feature Ray Tracing Cores and RTX technology SM75 or `SM_75, compute_75` – GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4 Turing GPU
Ampere	Ampere	sm_80, sm_86, sm_87 (Orin)	8.0, 8.6	RTX 3000, A-series	CUDA 11.1 and later	Features third-generation Tensor Cores and more Ampere GPU SM80 or `SM_80, compute_80` – NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100 SM86 or `SM_86, compute_86` – (from CUDA 11.1 onwards) Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, A5000, A6000, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU SM87 or `SM_87, compute_87` – (from CUDA 11.4 onwards, introduced with PTX ISA 7.4 / Driver r470 and newer) – for Jetson AGX Orin and Drive AGX Orin only “Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.“
Lovelace	Ada Lovelace^[2]	sm_89	8.9	GeForce RTX 4070 Ti (AD104) GeForce RTX 4080 (AD103) GeForce RTX 4090 (AD102) Nvidia RTX 6000 Ada Generation (AD102, formerly Quadro) Nvidia L40 (AD102, formerly Tesla)	CUDA 11.8 and later cuDNN 8.6 and later	Fourth-Gen Tensor Cores increasing throughput by up to 5X, to 1.4 Tensor-petaFLOPS using the new FP8 Transformer Engine (like H100 model) Third-generation RT Cores have twice the ray-triangle intersection throughput, increasing RT-TFLOP performance by over 2x The new RT Cores also include a new Opacity Micromap (OMM) Engine and a new Displaced Micro-Mesh (DMM) Engine. The OMM Engine enables much faster ray tracing of alpha-tested textures often used for foliage, particles, and fences. The DMM Engine delivers up to 10X faster Bounding Volume Hierarchy (BVH) build time with up to 20X less BVH storage space, enabling real-time ray tracing of geometrically complex scenes Shader Execution Reordering (SER) technology dynamically reorganizes these previously inefficient workloads into considerably more efficient ones. SER can improve shader performance for ray tracing operations by up to 3X, and in-game frame rates by up to 25%. DLSS 3 is a revolutionary breakthrough in AI-powered graphics that massively boosts performance. Powered by the new fourth-gen Tensor Cores and Optical Flow Accelerator on GeForce RTX 40 Series GPUs, DLSS 3 uses AI to create additional high-quality frames Graphics cards built upon the Ada architecture feature new eighth generation NVIDIA Encoders (NVENC) with AV1 encoding, enabling a raft of new possibilities for streamers, broadcasters, and video callers. It’s 40% more efficient than H.264 and allows users who are streaming at 1080p to increase their stream resolution to 1440p while running at the same bitrate and quality. SM89 or `SM_89, compute_`89 – NVIDIA GeForce RTX 4090, RTX 4080, RTX 6000, Tesla L40
Hopper^[3]	Hopper	sm_90, sm_90a(Thor)	9.0		CUDA 12 and later	TODO SM90 or `SM_90, compute_90` – NVIDIA H100 (GH100) SM90a or `SM_90a, compute_90a` – (for PTX ISA version 8.0) – adds acceleration for features like `wgmma` and `setmaxnreg`. This is required for NVIDIA CUTLASS

NVIDIA GPU Models

Model	Architecture	CUDA Cores	Tensor Cores	RT Cores	NVLink	FF	Memory Size	MIG^[4]	Memory Bandwidth	TDP	Launch Date
H100-SXM5	Hopper (GH100)	16896	4th Gen 528	No		SXM5	80GB HBM3 50 MB L2 cache	7@10GB	3.35TB/s	700W	Jan 2023
H100-PCIE^[5]^[6]	Hopper (GH100)	14592	4th Gen 456	No		PCIe Gen 5 x16	80 GB HBM2 50 MB L2 cache	7@10GB	2TB/s	300~350W	Jan 2023
Tesla C1060	Tesla	240	No	No			4 GB GDDR3		102 GB/s	238W	Dec 2008
Tesla K10	Kepler	3072	No	No			8 GB GDDR5		320 GB/s	225W	May 2012
Tesla K20	Kepler	2496	No	No			5/6 GB GDDR5		208 GB/s	225W	Nov 2012
Tesla K40	Kepler	2880	No	No			12 GB GDDR5		288 GB/s	235W	Nov 2013
Tesla K80	Kepler	4992	No	No			24 GB GDDR5		480 GB/s	300W	Nov 2014
Tesla M40	Maxwell	3072	No	No			12 GB GDDR5		288 GB/s	250W	Nov 2015
Tesla P4	Pascal	2560	No	No			8 GB GDDR5		192 GB/s	75W	Sep 2016
Tesla P40	Pascal	3840	No	No			24 GB GDDR5X		480 GB/s	250W	Sep 2016
Tesla V100	Volta	5120	640	Yes			16/32 GB HBM2		900 GB/s	300W	May 2017
Tesla T4	Turing	2560	320	No			16 GB
A100 PCIe	Ampere (GA100)	6912	432	Yes			40 GB HBM2 / 80 GB HBM2		1555 GB/s	250W	May 2020
A100 SXM4	Ampere	6912	432	Yes			40 GB HBM2 / 80 GB HBM2	7	1555 GB/s	400W	May 2020
A30	Ampere	7424	184	No			24 GB GDDR6	4	696 GB/s	165W	Apr 2021
A40^[7]	Ampere	10752	336	84	NVIDIA® NVLink® 112.5 GB/s (bidirectional)3 PCIe Gen4: 64GB/s	PCI 4.4" (H) x 10.5" (L) dual sl, Passive	48 GB GDDR6 with ECC		696 GB/s	300W	Apr 2021
A10	Ampere	10240	320	No			24 GB GDDR6		624 GB/s	150W	Mar 2021
A16^[8]	Ampere	5120	3rd Gen 160	40		PCIe Gen4 x16	64 GB GDDR6		800 GB/s	250W	Mar 2021
A100 80GB	Ampere (GA100)	6912	432	-			80 GB HBM2e	7@ 10GB	1935GB/s	300W	Apr 2021
A100 40GB	Ampere (GA100)	6912	432	Yes			40 GB HBM2	7@ 5GB	1555 GB/s	250W	May 2020
A200 PCIe	Ampere	10752	672	Yes			80 GB HBM2 / 160 GB HBM2		2050 GB/s	400W	Nov 2021
A200 SXM4	Ampere	10752	672	Yes			80 GB HBM2 / 160 GB HBM2		2050 GB/s	400W	Nov 2021
A6000^[9]	Ada Lovelace	18176	568	142			48GB GDDR6		960 GB/s	300 W	Jan 2023
A6000^[10]	Ampere	10752	336	84			48 GB GDDR6		768 GB/s	300 W
A5000	Ampere	8192	256	Yes			24 GB GDDR6		768 GB/s	230W	Apr 2021
A4000^[11]	Ampere	6144	192	Yes			16 GB GDDR6		512 GB/s	140W	Apr 2021
A3000	Ampere	3584	112	Yes			24 GB G
Titan RTX	Turing	4608	576	Yes			24 GB GDDR6		672 GB/s	280W	Dec 2018
GeForce RTX 4090	Ada Lovelace	16384	512	Yes, 128			24 GB GDDR6X		21.2Gbps	450W
GeForce RTX 3090 Ti	Turing	10752	336	84			24 GB GDDR6X		21.2Gbps	450W
GeForce RTX 3090	Turing	10496	328	Yes			24 GB GDDR6X		936 GB/s	350W	Sep 2020
GeForce RTX 3080 Ti	Turing	10240	320	Yes			12 GB GDDR6X		912 GB/s	350W	May 2021
GeForce RTX 3080	Turing	8704	272	Yes			10 GB GDDR6X		760 GB/s	320W	Sep 2020
GeForce RTX 3070 Ti	Turing	6144	192	Yes			8 GB GDDR6X		608 GB/s	290W	Jun 2021
GeForce RTX 3070	Turing	5888	184	Yes			8 GB GDDR6		448 GB/s	220W	Oct 2020
GeForce RTX 3060 Ti	Turing	4864	152	Yes			8 GB GDDR6		448 GB/s	200W	Dec 2020
GeForce RTX 3060	Turing	3584	112	No			12 GB GDDR6		360 GB/s	170W	Feb 2021
Quadro RTX 8000	Turing	4608	576	Yes			48 GB GDDR6		624 GB/s	295W	Aug 2018
Quadro RTX 6000	Turing	4608	576	Yes			24 GB GDDR6		432 GB/s	260W	Aug 2018
Tesla L40^[12]	Ada Lovelace	18,176	4th Gen 568	3rd Gen 142		PCIe Gen4x1	48GB GDDR6 with ECC		864GB/s	300W	Jan 2023
Quadro RTX 5000	Turing	3072	384	Yes			16 GB GDDR6		448 GB/s	230W	Nov 2018
Quadro RTX 4000	Turing	2304	288	Yes			8 GB GDDR6		416 GB/s	160W	Nov 2018
Titan RTX (T-Rex)	Turing	4608	576	No			24 GB		672 Gb/s	280 W
Titan V	Volta	5120	640				12 GB HBM2		652.8 GB/s	250W	Dec 2017
Tesla V100 (PCIe)	Volta	5120	640	No			32/16 GB HBM2		900 GB/s	250W	June 2017
Tesla V100 (SXM2)	Volta	5120	640	No			32/16 GB HBM2		900 GB/s	300W	June 2017
Quadro GV100	Volta	5120	640	No			32 GB HBM2		870 GB/s	250W	Mar 2018
Tesla GV100 (SXM2)	Volta	5120	640	No			32 GB HBM2		900 GB/s	300W	Mar 2018

1998-01-01T00:00:00Z

2000-12-31T00:00:00Z

Fahrenheit

1999-01-01T00:00:00Z

2003-12-31T00:00:00Z

Celsius

2001-01-01T00:00:00Z

2003-12-31T00:00:00Z

Kelvin

2003-01-01T00:00:00Z

2005-12-31T00:00:00Z

Rankine

2003-01-01T00:00:00Z

2013-12-31T00:00:00Z

Curie

2006-01-01T00:00:00Z

2010-12-31T00:00:00Z

Tesla

2007-01-01T00:00:00Z

2013-12-31T00:00:00Z

Tesla 2.0

2010-01-01T00:00:00Z

2016-12-31T00:00:00Z

Fermi

2010-01-01T00:00:00Z

2013-12-31T00:00:00Z

VLIW Vec4

2010-01-01T00:00:00Z

2016-12-31T00:00:00Z

Fermi 2.0

2012-01-01T00:00:00Z

2018-12-31T00:00:00Z

Kepler

2013-01-01T00:00:00Z

2015-12-31T00:00:00Z

Kepler 2.0

2014-01-01T00:00:00Z

2017-12-31T00:00:00Z

Maxwell

2014-01-01T00:00:00Z

2019-12-31T00:00:00Z

Maxwell 2.0

2016-01-01T00:00:00Z

2021-12-31T00:00:00Z

Pascal

2017-01-01T00:00:00Z

2020-12-31T00:00:00Z

Volta

2018-01-01T00:00:00Z

2022-12-31T00:00:00Z

Turing

2020-01-01T00:00:00Z

2023-12-31T00:00:00Z

Ampere

2022-01-01T00:00:00Z

2023-12-31T00:00:00Z

Hopper

2022-01-01T00:00:00Z

2023-12-31T00:00:00Z

Ada Lovelace

{"selectable":false,"end":"2027-12-31T00:00:00Z","height":"512px","start":"2005-01-01T00:00:00Z","showMajorLabels":false,"zoomable":false}

NVIDIA Features by Architecture^[13]

NVIDIA GPU Architectures
	AD102	GA102	GA100	TU102	GV100	GP102	GP100
Launch Year	2022	2020	2020	2018	2017	2017	–
Architecture	Ada Lovelace	Ampere	Ampere	Turing	Volta	Pascal	Pascal
Form Factor	–	–	SXM4/PCIe	–	SXM2/PCIe	–	SXM/PCIe
TDP	–	–	400W	–	300W	–	300W
Node	TSMC 4N	SAMSUNG 8N	–	TSMC 12nm	TSMC 12nm	TSMC 16nm	–
CUDA Cores	18432	10752	–	4608	5120	3840	–
Tensor Cores	576 Gen4	336 Gen3	–	576 Gen2	640	–	–
RT Cores	144 Gen3	84 Gen2	–	72 Gen1	–	–	–
Memory Bus	GDDR6X 384-bit	GDDR6X 384-bit	–	GDDR6 384-bit	HBM2 3072-bit	GDDR6X 384-bit	–

NVIDIA Grace Architecture

NVIDIA has announced that they will be partnering with server manufacturers such as HPE, Atos, and Supermicro to create servers that integrate the Grace architecture with ARM-based CPUs. These servers are expected to be available in the second half of 2023, by then HPCMATE starts to offer those products through local and global partners.

Architecture	Key Features
Grace	CPU-GPU integration, ARM Neoverse CPU, HBM2E memory
	900 GB/s memory bandwidth, support for PCIe 5.0 and NVLink
	10x performance improvement for certain HPC workloads
	Energy efficiency improvements through unified memory space

Reference

[1] ttps://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

[2] ttps://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture)

[3] ttps://www.nvidia.com/en-us/data-center/h100/

[4] ttps://docs.nvidia.com/datacenter/tesla/mig-user-guide/

[5] ttps://www.nvidia.com/content/dam/en-zz/Solutions/gtcs22/data-center/h100/PB-11133-001_v01.pdf

[6] ttps://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet

[7] ttps://images.nvidia.com/content/Solutions/data-center/a40/nvidia-a40-datasheet.pdf

[8] ttps://images.nvidia.com/content/Solutions/data-center/vgpu-a16-datasheet.pdf

[9] ttps://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/proviz-print-rtx6000-datasheet-web-2504660.pdf

[10] ttps://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/proviz-print-nvidia-rtx-a6000-datasheet-us-nvidia-1454980-r9-web%20(1).pdf

[11] ttps://www.nvidia.com/content/dam/en-zz/Solutions/gtcs21/rtx-a4000/nvidia-rtx-a4000-datasheet.pdf

[12] ttps://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/support-guide/NVIDIA-L40-Datasheet-January-2023.pdf

[13] ttps://videocardz.com/newz/nvidia-details-ad102-gpu-up-to-18432-cuda-cores-76-3b-transistors-and-608-mm%C2%B2

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

NVIDIA GPU: Difference between revisions

Revision as of 10:12, 11 July 2023

Contents

GPU Tenser performance notes for RTX 4090

NVIDIA GPU Architecture

NVIDIA GPU Models

NVIDIA Features by Architecture^[13]

NVIDIA Grace Architecture

Reference

Navigation menu

NVIDIA GPU: Difference between revisions

Revision as of 10:12, 11 July 2023

GPU Tenser performance notes for RTX 4090

NVIDIA GPU Architecture

NVIDIA GPU Models

NVIDIA Features by Architecture[13]

NVIDIA Grace Architecture

Reference

Navigation menu

Search

NVIDIA Features by Architecture^[13]