Enable AMD CPU with Multi-GPU System
AMD CPU cause deadlocks with multi-GPU in single system
There is an deadlocks issue report[1] when running multi-GPU DDP training on AMD CPU system regardless type of GPU whether Nvidia or AMD Instinct. AMD also reported Multi-GPU environments are failing due to deadlocks from limitations of the IOMMU enablement[2].
The deadlock issue can be easily reproducable when user try to run DistributedDataParallel (DDP) on AMD CPU with multi-gpu environment.
What is DDP?
When you’re training a model on more than one GPU, there are different ways to distribute the model across the GPUs. DDP replicates the model on each GPU and then trains each on a subset of the data. This means that each model replica can just train as fast as it can on the data it’s given.
However, every time the optimiser makes a step and performs a backward pass to update the gradients, it needs to update the gradients equally across all the GPUs. DDP works by spawning multiple processes (to avoid the GIL) and handles the communication between the hardware when updating gradients.
Solution - Disable IOMMU
To solve it, you need to disable IOMMU
IOMMU is a BIOS-level component that basically acts as an interface that maps virtual addresses to the physical addresses on your GPU or other devices. It’s all about making the memory from the CPU and GPU work better together.
There are two ways to disable IOMMU,
- Disable IOMMU on BIOS of your mother board
- Disable IOMMU features through Linux kernel parameter
#Edit GRUB_CMDLINE_LINUX_DEFAULT value of /etc/default/grub file
# To disable IOMMU, add kernel param "amd_iommu=off"
# To enable IOMMU, add kernel param "amd_iommu=on iommu=pt"
# Then, update grub and reboot
sudo grub-mkconfig -o /boot/grub/grub.cfg (same with sudo update-grub on Ubuntu)
sudo reboot
# Cent OS has grub2-mkconfig instead of grub-mkconfig on Ubuntu
Others
- For more about IOMMU in Linux, this article from Lenovo provides a very detailed report on IOMMU.