Enable AMD CPU with Multi-GPU System: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
== AMC CPU cause deadlocks with multi-GPU in single system ==
== AMD CPU cause deadlocks with multi-GPU in single system ==
There is an deaclocks issue report<ref>https://github.com/pytorch/pytorch/issues/52142</ref> on multi-GPU DDP training on AMD CPU system regardless of type of GPU whether Nvidia or AMD Instinct. AMD also reported Multi-GPU environments are failing due to deadlocks from limitations of the IOMMU enablement<ref>https://community.amd.com/t5/knowledge-base/iommu-advisory-for-multi-gpu-environments/ta-p/477468</ref>.  
There is an deadlocks issue report<ref>https://github.com/pytorch/pytorch/issues/52142</ref> when running multi-GPU DDP training on AMD CPU system regardless type of GPU whether Nvidia or AMD Instinct. AMD also reported Multi-GPU environments are failing due to deadlocks from limitations of the IOMMU enablement<ref>https://community.amd.com/t5/knowledge-base/iommu-advisory-for-multi-gpu-environments/ta-p/477468</ref>.  


The deadlock issue can be easily reproducable when user try to run [https://pytorch.org/tutorials/intermediate/ddp_tutorial.html DistributedDataParallel (DDP)]  on AMD CPU with multi-gpu environment.
The deadlock issue can be easily reproducable when user try to run [https://pytorch.org/tutorials/intermediate/ddp_tutorial.html DistributedDataParallel (DDP)]  on AMD CPU with multi-gpu environment.
Line 20: Line 20:
# Disable IOMMU features through [[Linux]] kernel parameter
# Disable IOMMU features through [[Linux]] kernel parameter
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
#To enable IOMMU on Ubuntu
#Edit GRUB_CMDLINE_LINUX_DEFAULT value of /etc/default/grub file
sudo bash -c 'echo GRUB_CMDLINE_LINUX ="amd_iommu=on iommu=pt" >> /etc/default/grub'
# To disable IOMMU, add kernel param "amd_iommu=off"
#To disable IOMMU on Ubuntu
# To enable IOMMU, add kernel param "amd_iommu=on iommu=pt"
sudo bash -c 'echo GRUB_CMDLINE_LINUX ="amd_iommu=off" >> /etc/default/grub'
 
#Configure grub boot param then reboot
# Then, update grub and reboot
sudo grub-mkconfig -o /boot/efi/EFI/centos/grub.cfg
sudo grub-mkconfig -o /boot/grub/grub.cfg (same with sudo update-grub on Ubuntu)
sudo reboot
sudo reboot


#To enable IOMMU on CentOS
# Cent OS has grub2-mkconfig instead of grub-mkconfig on Ubuntu
sudo bash -c 'echo GRUB_CMDLINE_LINUX ="amd_iommu=on iommu=pt" >> /etc/default/grub'
#To disable IOMMU on CentOS
sudo bash -c 'echo GRUB_CMDLINE_LINUX ="amd_iommu=off" >> /etc/default/grub'
#Configure grub boot param then reboot
sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
sudo reboot
</syntaxhighlight>
</syntaxhighlight>
We can also add "amd_iommu=off" to disable or "amd_iommu=on iommu=pt" to enable by modifing values of  GRUB_CMDLINE_LINUX_DEFAULT too


== Others ==
== Others ==

Latest revision as of 13:53, 6 June 2024

AMD CPU cause deadlocks with multi-GPU in single system

There is an deadlocks issue report[1] when running multi-GPU DDP training on AMD CPU system regardless type of GPU whether Nvidia or AMD Instinct. AMD also reported Multi-GPU environments are failing due to deadlocks from limitations of the IOMMU enablement[2].

The deadlock issue can be easily reproducable when user try to run DistributedDataParallel (DDP) on AMD CPU with multi-gpu environment.

What is DDP?

When you’re training a model on more than one GPU, there are different ways to distribute the model across the GPUs. DDP replicates the model on each GPU and then trains each on a subset of the data. This means that each model replica can just train as fast as it can on the data it’s given.

However, every time the optimiser makes a step and performs a backward pass to update the gradients, it needs to update the gradients equally across all the GPUs. DDP works by spawning multiple processes (to avoid the GIL) and handles the communication between the hardware when updating gradients.

Solution - Disable IOMMU

To solve it, you need to disable IOMMU

IOMMU is a BIOS-level component that basically acts as an interface that maps virtual addresses to the physical addresses on your GPU or other devices. It’s all about making the memory from the CPU and GPU work better together.


There are two ways to disable IOMMU,

  1. Disable IOMMU on BIOS of your mother board
  2. Disable IOMMU features through Linux kernel parameter
#Edit GRUB_CMDLINE_LINUX_DEFAULT value of /etc/default/grub file
# To disable IOMMU, add kernel param "amd_iommu=off" 
# To enable IOMMU, add kernel param "amd_iommu=on iommu=pt"

# Then, update grub and reboot
sudo grub-mkconfig -o /boot/grub/grub.cfg  (same with sudo update-grub on Ubuntu)
sudo reboot

# Cent OS has grub2-mkconfig instead of grub-mkconfig on Ubuntu

Others

References