Enable AMD CPU with Multi-GPU System

From HPCWIKI
Jump to navigation Jump to search

AMD CPU cause deadlocks with multi-GPU in single system

There is an deadlocks issue report[1] when running multi-GPU DDP training on AMD CPU system regardless type of GPU whether Nvidia or AMD Instinct. AMD also reported Multi-GPU environments are failing due to deadlocks from limitations of the IOMMU enablement[2].

The deadlock issue can be easily reproducable when user try to run DistributedDataParallel (DDP) on AMD CPU with multi-gpu environment.

What is DDP?

When you’re training a model on more than one GPU, there are different ways to distribute the model across the GPUs. DDP replicates the model on each GPU and then trains each on a subset of the data. This means that each model replica can just train as fast as it can on the data it’s given.

However, every time the optimiser makes a step and performs a backward pass to update the gradients, it needs to update the gradients equally across all the GPUs. DDP works by spawning multiple processes (to avoid the GIL) and handles the communication between the hardware when updating gradients.

Solution - Disable IOMMU

To solve it, you need to disable IOMMU

IOMMU is a BIOS-level component that basically acts as an interface that maps virtual addresses to the physical addresses on your GPU or other devices. It’s all about making the memory from the CPU and GPU work better together.


There are two ways to disable IOMMU,

  1. Disable IOMMU on BIOS of your mother board
  2. Disable IOMMU features through Linux kernel parameter
#Edit GRUB_CMDLINE_LINUX_DEFAULT value of /etc/default/grub file
# To disable IOMMU, add kernel param "amd_iommu=off" 
# To enable IOMMU, add kernel param "amd_iommu=on iommu=pt"

# Then, update grub and reboot
sudo grub-mkconfig -o /boot/grub/grub.cfg  (same with sudo update-grub on Ubuntu)
sudo reboot

# Cent OS has grub2-mkconfig instead of grub-mkconfig on Ubuntu

Others

References