CUDA

From HPCWIKI
Jump to navigation Jump to search

Compile CUDA code

When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.

If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.

When you want to speed up CUDA compilation, you want to reduce the amount of irrelevant ‘-gencode‘ flags. However, sometimes you may wish to have better CUDA backwards compatibility by adding more comprehensive ‘-gencode‘ flags.

Sample nvcc gencode and arch Flags in GCC

According to NVIDIA:

The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.

  • GCC generation on CUDA 7.0 for maximum compatibility with all cards from the era
-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_52,code=compute_52
  • generation on CUDA 8.1 for maximum compatibility with cards predating Volta:
-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_60,code=sm_60 \
 -gencode=arch=compute_61,code=sm_61 \
 -gencode=arch=compute_61,code=compute_61
  • generation on CUDA 9.2 for maximum compatibility with Volta cards
-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_70,code=compute_70
  • Sample flags for generation on CUDA 10.1 for maximum compatibility with V100 and T4 Turing cards:
-arch=sm_50 \ 
-gencode=arch=compute_50,code=sm_50 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75 
  • Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards:
-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80 
  • Sample flags for generation on CUDA 11.7 for maximum compatibility with V100 and T4 Turing cards, but also support newer RTX 3080, and Drive AGX Orin:
-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87
-gencode=arch=compute_86,code=compute_86
  • Sample flags for generation on CUDA 11.4 for best performance with RTX 3080 cards:
-arch=sm_80 \ 
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_86,code=compute_86
  • Sample flags for generation on CUDA 12 for best performance with GeForce RTX 4080 cards:
-arch=sm_89 \ 
-gencode=arch=compute_89,code=sm_89 \
-gencode=arch=compute_89,code=compute_89
  • Sample flags for generation on CUDA 12 (PTX ISA version 8.0) for best performance with NVIDIA H100 (Hopper) GPUs, and no backwards compatibility for previous generations:
-arch=sm_90 \ 
-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_90a,code=sm_90a \
-gencode=arch=compute_90a,code=compute_90a
  • To add more compatibility for Hopper GPUs and some backwards compatibility:
-arch=sm_52 \  
-gencode=arch=compute_52,code=sm_52 \  
-gencode=arch=compute_60,code=sm_60 \  
-gencode=arch=compute_61,code=sm_61 \  
-gencode=arch=compute_70,code=sm_70 \  
-gencode=arch=compute_75,code=sm_75 \ 
-gencode=arch=compute_80,code=sm_80 \ 
-gencode=arch=compute_86,code=sm_86 \ 
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_90,code=sm_90 \ 
-gencode=arch=compute_90,code=compute_90

Using TORCH_CUDA_ARCH_LIST for PyTorch

If you’re using PyTorch you can set the architectures using the TORCH_CUDA_ARCH_LIST env variable during installation like this:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6" python3 setup.py install

Note that while you can specify every single arch in this variable, each one will prolong the build time as kernels will have to compiled for every architecture.

You can also tell PyTorch to generate PTX code that is forward compatible by newer cards by adding a +PTX suffix to the most recent architecture you specify:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6+PTX" python3 build_my_extension.py

Using Cmake for TensorRT

If you’re compiling TensorRT with CMAKE, drop the sm_ and compute_ prefixes, refer only to the compute capabilities instead.

Example for Tesla V100 and Volta cards in general:

cmake <...> -DGPU_ARCHS="70"

Example for NVIDIA RTX 2070 and Tesla T4:

cmake <...> -DGPU_ARCHS="75"

Example for NVIDIA A100:

cmake <...> -DGPU_ARCHS="80"

Example for NVIDIA RTX 3080 and A100 together:

cmake <...> -DGPU_ARCHS="80 86"

Example for NVIDIA H100:

cmake <...> -DGPU_ARCHS="90"

Using Cmake for CUTLASS with Hopper GH100

cmake .. -DCUTLASS_NVCC_ARCHS=90a

What does "Value 'sm_86' is not defined for option 'gpu-architecture'" mean?

If you get an error that looks like this:

nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'

You probably have an older version of CUDA and/or the driver installed. Upgrade to a more recent driver, at least 450.36.06 or higher, to support sm_8x cards like the A100, RTX 3080.

CUDA Compatibility

CUDA Version cuDNN Version NCCL Version NVIDIA GPU Driver Version Compute Capability Support
CUDA 12.0
CUDA 11.8
CUDA 11.7
CUDA 11.5 8.3.x 2.10.x 510.39 or later Compute Capability 3.0 to 8.6
CUDA 11.4 8.2.x 2.10.x 470.42.01 or later Compute Capability 3.0 to 8.6
CUDA 11.3 8.2.x 2.10.x 465.19.01 or later Compute Capability 3.0 to 8.6
CUDA 11.2 8.1.x 2.9.x 460.32.03 or later Compute Capability 3.0 to 8.6
CUDA 11.1 8.0.x 2.9.x 455.23.04 or later Compute Capability 3.0 to 8.6
CUDA 11.0 7.6.x 2.8.x 450.36.06 or later Compute Capability 3.0 to 8.6
CUDA 10.2 7.6.x 2.7.x 440.33 or later Compute Capability 3.0 to 7.5
CUDA 10.1 7.6.x 2.4.x 418.39 or later Compute Capability 3.0 to 7.5
CUDA 10.0 7.4.x 2.2.x 410.48 or later Compute Capability 3.0 to 7.5
CUDA 9.2 7.2.x 2.1.x 396.26 or later Compute Capability 3.0 to 7.5
CUDA 9.1 7.1.x 2.0.x 390.46 or later Compute Capability 3.0 to 7.5
CUDA 9.0 7.0.x 1.3.x 384.81 or later Compute Capability 3.0 to 7.5
CUDA 8.0 6.0.x 1.3.x 375.26 or later Compute Capability 2.0 to 6.2
CUDA 7.5 5.1.x 1.3.x 352.31 or later Compute Capability 2.0 to 5.2

Reference