CUDA

Compile CUDA code

When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.

If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.

When you want to speed up CUDA compilation, you want to reduce the amount of irrelevant ‘-gencode‘ flags. However, sometimes you may wish to have better CUDA backwards compatibility by adding more comprehensive ‘-gencode‘ flags.

Sample nvcc gencode and arch Flags in GCC

According to NVIDIA:

The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.

GCC generation on CUDA 7.0 for maximum compatibility with all cards from the era

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_52,code=compute_52

generation on CUDA 8.1 for maximum compatibility with cards predating Volta:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_60,code=sm_60 \
 -gencode=arch=compute_61,code=sm_61 \
 -gencode=arch=compute_61,code=compute_61

generation on CUDA 9.2 for maximum compatibility with Volta cards

-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_70,code=compute_70

Sample flags for generation on CUDA 10.1 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_50 \ 
-gencode=arch=compute_50,code=sm_50 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80

Sample flags for generation on CUDA 11.7 for maximum compatibility with V100 and T4 Turing cards, but also support newer RTX 3080, and Drive AGX Orin:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87
-gencode=arch=compute_86,code=compute_86

Sample flags for generation on CUDA 11.4 for best performance with RTX 3080 cards:

-arch=sm_80 \ 
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_86,code=compute_86

Sample flags for generation on CUDA 12 for best performance with GeForce RTX 4080 cards:

-arch=sm_89 \ 
-gencode=arch=compute_89,code=sm_89 \
-gencode=arch=compute_89,code=compute_89

Sample flags for generation on CUDA 12 (PTX ISA version 8.0) for best performance with NVIDIA H100 (Hopper) GPUs, and no backwards compatibility for previous generations:

-arch=sm_90 \ 
-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_90a,code=sm_90a \
-gencode=arch=compute_90a,code=compute_90a

To add more compatibility for Hopper GPUs and some backwards compatibility:

-arch=sm_52 \  
-gencode=arch=compute_52,code=sm_52 \  
-gencode=arch=compute_60,code=sm_60 \  
-gencode=arch=compute_61,code=sm_61 \  
-gencode=arch=compute_70,code=sm_70 \  
-gencode=arch=compute_75,code=sm_75 \ 
-gencode=arch=compute_80,code=sm_80 \ 
-gencode=arch=compute_86,code=sm_86 \ 
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_90,code=sm_90 \ 
-gencode=arch=compute_90,code=compute_90

Using TORCH_CUDA_ARCH_LIST for PyTorch

If you’re using PyTorch you can set the architectures using the TORCH_CUDA_ARCH_LIST env variable during installation like this:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6" python3 setup.py install

Note that while you can specify every single arch in this variable, each one will prolong the build time as kernels will have to compiled for every architecture.

You can also tell PyTorch to generate PTX code that is forward compatible by newer cards by adding a +PTX suffix to the most recent architecture you specify:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6+PTX" python3 build_my_extension.py

Using Cmake for TensorRT

If you’re compiling TensorRT with CMAKE, drop the sm_ and compute_ prefixes, refer only to the compute capabilities instead.

Example for Tesla V100 and Volta cards in general:

cmake <...> -DGPU_ARCHS="70"

Example for NVIDIA RTX 2070 and Tesla T4:

cmake <...> -DGPU_ARCHS="75"

Example for NVIDIA A100:

cmake <...> -DGPU_ARCHS="80"

Example for NVIDIA RTX 3080 and A100 together:

cmake <...> -DGPU_ARCHS="80 86"

Example for NVIDIA H100:

cmake <...> -DGPU_ARCHS="90"

Using Cmake for CUTLASS with Hopper GH100

cmake .. -DCUTLASS_NVCC_ARCHS=90a

What does `"Value 'sm_86' is not defined for option 'gpu-architecture'"` mean?

If you get an error that looks like this:

nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'

You probably have an older version of CUDA and/or the driver installed. Upgrade to a more recent driver, at least 450.36.06 or higher, to support sm_8x cards like the A100, RTX 3080.

CUDA Compatibility

CUDA Version	cuDNN Version	NCCL Version	NVIDIA GPU Driver Version	Compute Capability Support
CUDA 12.0
CUDA 11.8
CUDA 11.7
CUDA 11.5	8.3.x	2.10.x	510.39 or later	Compute Capability 3.0 to 8.6
CUDA 11.4	8.2.x	2.10.x	470.42.01 or later	Compute Capability 3.0 to 8.6
CUDA 11.3	8.2.x	2.10.x	465.19.01 or later	Compute Capability 3.0 to 8.6
CUDA 11.2	8.1.x	2.9.x	460.32.03 or later	Compute Capability 3.0 to 8.6
CUDA 11.1	8.0.x	2.9.x	455.23.04 or later	Compute Capability 3.0 to 8.6
CUDA 11.0	7.6.x	2.8.x	450.36.06 or later	Compute Capability 3.0 to 8.6
CUDA 10.2	7.6.x	2.7.x	440.33 or later	Compute Capability 3.0 to 7.5
CUDA 10.1	7.6.x	2.4.x	418.39 or later	Compute Capability 3.0 to 7.5
CUDA 10.0	7.4.x	2.2.x	410.48 or later	Compute Capability 3.0 to 7.5
CUDA 9.2	7.2.x	2.1.x	396.26 or later	Compute Capability 3.0 to 7.5
CUDA 9.1	7.1.x	2.0.x	390.46 or later	Compute Capability 3.0 to 7.5
CUDA 9.0	7.0.x	1.3.x	384.81 or later	Compute Capability 3.0 to 7.5
CUDA 8.0	6.0.x	1.3.x	375.26 or later	Compute Capability 2.0 to 6.2
CUDA 7.5	5.1.x	1.3.x	352.31 or later	Compute Capability 2.0 to 5.2

CUDA

Contents

Compile CUDA code

Sample nvcc gencode and arch Flags in GCC

Using TORCH_CUDA_ARCH_LIST for PyTorch

Using Cmake for TensorRT

Using Cmake for CUTLASS with Hopper GH100

What does `"Value 'sm_86' is not defined for option 'gpu-architecture'"` mean?

CUDA Compatibility

Reference

Navigation menu

CUDA

Compile CUDA code

Sample nvcc gencode and arch Flags in GCC

Using TORCH_CUDA_ARCH_LIST for PyTorch

Using Cmake for TensorRT

Using Cmake for CUTLASS with Hopper GH100

What does "Value 'sm_86' is not defined for option 'gpu-architecture'" mean?

CUDA Compatibility

Reference

Navigation menu

Search

What does `"Value 'sm_86' is not defined for option 'gpu-architecture'"` mean?