Enable AMD CPU with Multi-GPU System: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
== AMC CPU cause deadlocks with multi-GPU in single system ==
== AMC CPU cause deadlocks with multi-GPU in single system ==
There is an deaclocks issue report<ref>https://github.com/pytorch/pytorch/issues/52142</ref> on multi-GPU DDP training on AMD CPU system regardless of type of GPU whether Nvidia or AMD Instinct. AMD also reported Multi-GPU environments are failing due to deadlocks from limitations of the IOMMU enablement<ref>https://community.amd.com/t5/knowledge-base/iommu-advisory-for-multi-gpu-environments/ta-p/477468</ref>.  
There is an deaclocks issue report<ref>https://github.com/pytorch/pytorch/issues/52142</ref> when running multi-GPU DDP training on AMD CPU system regardless of type of GPU whether Nvidia or AMD Instinct. AMD also reported Multi-GPU environments are failing due to deadlocks from limitations of the IOMMU enablement<ref>https://community.amd.com/t5/knowledge-base/iommu-advisory-for-multi-gpu-environments/ta-p/477468</ref>.  


The deadlock issue can be easily reproducable when user try to run [https://pytorch.org/tutorials/intermediate/ddp_tutorial.html DistributedDataParallel (DDP)]  on AMD CPU with multi-gpu environment.
The deadlock issue can be easily reproducable when user try to run [https://pytorch.org/tutorials/intermediate/ddp_tutorial.html DistributedDataParallel (DDP)]  on AMD CPU with multi-gpu environment.

Revision as of 11:18, 12 October 2023

AMC CPU cause deadlocks with multi-GPU in single system

There is an deaclocks issue report[1] when running multi-GPU DDP training on AMD CPU system regardless of type of GPU whether Nvidia or AMD Instinct. AMD also reported Multi-GPU environments are failing due to deadlocks from limitations of the IOMMU enablement[2].

The deadlock issue can be easily reproducable when user try to run DistributedDataParallel (DDP) on AMD CPU with multi-gpu environment.

What is DDP?

When you’re training a model on more than one GPU, there are different ways to distribute the model across the GPUs. DDP replicates the model on each GPU and then trains each on a subset of the data. This means that each model replica can just train as fast as it can on the data it’s given.

However, every time the optimiser makes a step and performs a backward pass to update the gradients, it needs to update the gradients equally across all the GPUs. DDP works by spawning multiple processes (to avoid the GIL) and handles the communication between the hardware when updating gradients.

Solution - Disable IOMMU

To solve it, you need to disable IOMMU

IOMMU is a BIOS-level component that basically acts as an interface that maps virtual addresses to the physical addresses on your GPU or other devices. It’s all about making the memory from the CPU and GPU work better together.


There are two ways to disable IOMMU,

  1. Disable IOMMU on BIOS of your mother board
  2. Disable IOMMU features through Linux kernel parameter
#To enable IOMMU on Ubuntu
sudo bash -c 'echo GRUB_CMDLINE_LINUX ="amd_iommu=on iommu=pt" >> /etc/default/grub'
#To disable IOMMU on Ubuntu
sudo bash -c 'echo GRUB_CMDLINE_LINUX ="amd_iommu=off" >> /etc/default/grub'
#Configure grub boot param then reboot
sudo grub-mkconfig -o /boot/efi/EFI/centos/grub.cfg
sudo reboot

#To enable IOMMU on CentOS
sudo bash -c 'echo GRUB_CMDLINE_LINUX ="amd_iommu=on iommu=pt" >> /etc/default/grub'
#To disable IOMMU on CentOS
sudo bash -c 'echo GRUB_CMDLINE_LINUX ="amd_iommu=off" >> /etc/default/grub'
#Configure grub boot param then reboot
sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
sudo reboot


We can also add "amd_iommu=off" to disable or "amd_iommu=on iommu=pt" to enable by modifing values of GRUB_CMDLINE_LINUX_DEFAULT too

Others

References