CUDA
Compile CUDA code
When you compile CUDA code, you should always compile only one ‘-arch
‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.
If you only mention ‘-gencode
‘, but omit the ‘-arch
‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.
When you want to speed up CUDA compilation, you want to reduce the amount of irrelevant ‘-gencode
‘ flags. However, sometimes you may wish to have better CUDA backwards compatibility by adding more comprehensive ‘-gencode
‘ flags.
Sample nvcc gencode and arch Flags in GCC
According to NVIDIA:
The
arch=
clause of the-gencode=
command-line option tonvcc
specifies the front-end compilation target and must always be a PTX version. Thecode=
clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by thecode=
clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.
- GCC generation on CUDA 7.0 for maximum compatibility with all cards from the era
-arch=sm_30 \ -gencode=arch=compute_20,code=sm_20 \ -gencode=arch=compute_30,code=sm_30 \ -gencode=arch=compute_50,code=sm_50 \ -gencode=arch=compute_52,code=sm_52 \ -gencode=arch=compute_52,code=compute_52
- generation on CUDA 8.1 for maximum compatibility with cards predating Volta:
-arch=sm_30 \ -gencode=arch=compute_20,code=sm_20 \ -gencode=arch=compute_30,code=sm_30 \ -gencode=arch=compute_50,code=sm_50 \ -gencode=arch=compute_52,code=sm_52 \ -gencode=arch=compute_60,code=sm_60 \ -gencode=arch=compute_61,code=sm_61 \ -gencode=arch=compute_61,code=compute_61
- generation on CUDA 9.2 for maximum compatibility with Volta cards
-arch=sm_50 \ -gencode=arch=compute_50,code=sm_50 \ -gencode=arch=compute_52,code=sm_52 \ -gencode=arch=compute_60,code=sm_60 \ -gencode=arch=compute_61,code=sm_61 \ -gencode=arch=compute_70,code=sm_70 \ -gencode=arch=compute_70,code=compute_70
- Sample flags for generation on CUDA 10.1 for maximum compatibility with V100 and T4 Turing cards:
-arch=sm_50 \ -gencode=arch=compute_50,code=sm_50 \ -gencode=arch=compute_52,code=sm_52 \ -gencode=arch=compute_60,code=sm_60 \ -gencode=arch=compute_61,code=sm_61 \ -gencode=arch=compute_70,code=sm_70 \ -gencode=arch=compute_75,code=sm_75 \ -gencode=arch=compute_75,code=compute_75
- Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards:
-arch=sm_52 \ -gencode=arch=compute_52,code=sm_52 \ -gencode=arch=compute_60,code=sm_60 \ -gencode=arch=compute_61,code=sm_61 \ -gencode=arch=compute_70,code=sm_70 \ -gencode=arch=compute_75,code=sm_75 \ -gencode=arch=compute_80,code=sm_80 \ -gencode=arch=compute_80,code=compute_80
- Sample flags for generation on CUDA 11.7 for maximum compatibility with V100 and T4 Turing cards, but also support newer RTX 3080, and Drive AGX Orin:
-arch=sm_52 \ -gencode=arch=compute_52,code=sm_52 \ -gencode=arch=compute_60,code=sm_60 \ -gencode=arch=compute_61,code=sm_61 \ -gencode=arch=compute_70,code=sm_70 \ -gencode=arch=compute_75,code=sm_75 \ -gencode=arch=compute_80,code=sm_80 \ -gencode=arch=compute_86,code=sm_86 \ -gencode=arch=compute_87,code=sm_87 -gencode=arch=compute_86,code=compute_86
- Sample flags for generation on CUDA 11.4 for best performance with RTX 3080 cards:
-arch=sm_80 \ -gencode=arch=compute_80,code=sm_80 \ -gencode=arch=compute_86,code=sm_86 \ -gencode=arch=compute_87,code=sm_87 \ -gencode=arch=compute_86,code=compute_86
- Sample flags for generation on CUDA 12 for best performance with GeForce RTX 4080 cards:
-arch=sm_89 \ -gencode=arch=compute_89,code=sm_89 \ -gencode=arch=compute_89,code=compute_89
- Sample flags for generation on CUDA 12 (PTX ISA version 8.0) for best performance with NVIDIA H100 (Hopper) GPUs, and no backwards compatibility for previous generations:
-arch=sm_90 \ -gencode=arch=compute_90,code=sm_90 \ -gencode=arch=compute_90a,code=sm_90a \ -gencode=arch=compute_90a,code=compute_90a
- To add more compatibility for Hopper GPUs and some backwards compatibility:
-arch=sm_52 \ -gencode=arch=compute_52,code=sm_52 \ -gencode=arch=compute_60,code=sm_60 \ -gencode=arch=compute_61,code=sm_61 \ -gencode=arch=compute_70,code=sm_70 \ -gencode=arch=compute_75,code=sm_75 \ -gencode=arch=compute_80,code=sm_80 \ -gencode=arch=compute_86,code=sm_86 \ -gencode=arch=compute_87,code=sm_87 \ -gencode=arch=compute_90,code=sm_90 \ -gencode=arch=compute_90,code=compute_90
Using TORCH_CUDA_ARCH_LIST for PyTorch
If you’re using PyTorch you can set the architectures using the TORCH_CUDA_ARCH_LIST
env variable during installation like this:
$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6" python3 setup.py install
Note that while you can specify every single arch in this variable, each one will prolong the build time as kernels will have to compiled for every architecture.
You can also tell PyTorch to generate PTX code that is forward compatible by newer cards by adding a +PTX
suffix to the most recent architecture you specify:
$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6+PTX" python3 build_my_extension.py
Using Cmake for TensorRT
If you’re compiling TensorRT with CMAKE, drop the sm_
and compute_
prefixes, refer only to the compute capabilities instead.
Example for Tesla V100 and Volta cards in general:
cmake <...> -DGPU_ARCHS="70"
Example for NVIDIA RTX 2070 and Tesla T4:
cmake <...> -DGPU_ARCHS="75"
Example for NVIDIA A100:
cmake <...> -DGPU_ARCHS="80"
Example for NVIDIA RTX 3080 and A100 together:
cmake <...> -DGPU_ARCHS="80 86"
Example for NVIDIA H100:
cmake <...> -DGPU_ARCHS="90"
Using Cmake for CUTLASS with Hopper GH100
cmake .. -DCUTLASS_NVCC_ARCHS=90a
What does "Value 'sm_86' is not defined for option 'gpu-architecture'"
mean?
If you get an error that looks like this:
nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'
You probably have an older version of CUDA and/or the driver installed. Upgrade to a more recent driver, at least 450.36.06 or higher, to support sm_8x cards like the A100, RTX 3080.
CUDA Compatibility
CUDA Version | cuDNN Version | NCCL Version | NVIDIA GPU Driver Version | Compute Capability Support |
---|---|---|---|---|
CUDA 12.0 | ||||
CUDA 11.8 | ||||
CUDA 11.7 | ||||
CUDA 11.5 | 8.3.x | 2.10.x | 510.39 or later | Compute Capability 3.0 to 8.6 |
CUDA 11.4 | 8.2.x | 2.10.x | 470.42.01 or later | Compute Capability 3.0 to 8.6 |
CUDA 11.3 | 8.2.x | 2.10.x | 465.19.01 or later | Compute Capability 3.0 to 8.6 |
CUDA 11.2 | 8.1.x | 2.9.x | 460.32.03 or later | Compute Capability 3.0 to 8.6 |
CUDA 11.1 | 8.0.x | 2.9.x | 455.23.04 or later | Compute Capability 3.0 to 8.6 |
CUDA 11.0 | 7.6.x | 2.8.x | 450.36.06 or later | Compute Capability 3.0 to 8.6 |
CUDA 10.2 | 7.6.x | 2.7.x | 440.33 or later | Compute Capability 3.0 to 7.5 |
CUDA 10.1 | 7.6.x | 2.4.x | 418.39 or later | Compute Capability 3.0 to 7.5 |
CUDA 10.0 | 7.4.x | 2.2.x | 410.48 or later | Compute Capability 3.0 to 7.5 |
CUDA 9.2 | 7.2.x | 2.1.x | 396.26 or later | Compute Capability 3.0 to 7.5 |
CUDA 9.1 | 7.1.x | 2.0.x | 390.46 or later | Compute Capability 3.0 to 7.5 |
CUDA 9.0 | 7.0.x | 1.3.x | 384.81 or later | Compute Capability 3.0 to 7.5 |
CUDA 8.0 | 6.0.x | 1.3.x | 375.26 or later | Compute Capability 2.0 to 6.2 |
CUDA 7.5 | 5.1.x | 1.3.x | 352.31 or later | Compute Capability 2.0 to 5.2 |