FAQ: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
== System reboot - by Software or Hardware ? ==
== System reboot - by Software or Hardware ? ==
If your system reboot randomly without kernel logs, the most possible reason would be power supply stability issue although the PSU might looks working well.
Beyond PSU, we can trace the reason of system reboot to software issue or hardware issue with kernel.panic parameters, 
If kernel.panic system parameter is <code>0,</code> it is turned off automatic reboot on panic, any other value is the number of seconds it wait before reboot.  
If kernel.panic system parameter is <code>0,</code> it is turned off automatic reboot on panic, any other value is the number of seconds it wait before reboot.  



Revision as of 12:49, 25 April 2023

System reboot - by Software or Hardware ?

If your system reboot randomly without kernel logs, the most possible reason would be power supply stability issue although the PSU might looks working well.

Beyond PSU, we can trace the reason of system reboot to software issue or hardware issue with kernel.panic parameters,

If kernel.panic system parameter is 0, it is turned off automatic reboot on panic, any other value is the number of seconds it wait before reboot.

With sysctl -w kernel.panic=0 you would turn it off, if it is not already off.


If this is set to 0 and your server still reboots itself, it would really think this is a hardware issue. If this stops the automatic rebooting, then we know the reboot is caused by a watchdog timer or other software issue

Failed to load plugin io.containerd and could not use snapshotter

  • Reason - warning or information from the snapshotter[1] - image storage - that we have a lot of choices
  • Impact : the warning log doesn't impact the whole system operating
  • Solve to

1.Disable the snapshotter plugins which you don't need by updating config file for your system and restart containerd, like

<# /etc/containerd/config.toml
disabled_plugins = ["cri", "btrfs"]

2. To use ZFS, you need to mount ZFS dataset on /var/lib/containerd/io.containerd.snapshotter.v1.zfs

3. To use btrfs, you need to mount btrfs to /var/lib/containerd/io.containerd.snapshotter.v1.btrfs

4. For aufs, you need to modprobe it as explained in the error log

Could not select device driver "" with capabilities: GPU

  • Reason - no nvidia-container-toolkit or currupt exist package
  • Solve to install/reinstall nvidia-container-toolkit then restart docker daemon
$distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ 
    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ 
    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list 

$sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

$sudo systemctl restart docker

Pytorch FAQ

  • How to get CUDA compute capability of a GPU?
    • $python -c "import torch; print(torch.cuda.get_arch_list())"

Show List Of Network Cards on Linux

  • lspci command : List all PCI devices.
    • #lspci | egrep -i --color 'network|ethernet'
    • #lspci | egrep -i --color 'network|ethernet|wireless|wi-fi'
  • lshw command : Linux identify Ethernet interfaces and NIC hardware.
    • #lshw -class network
    • $sudo lshw -class network -short
  • dmidecode command : List all hardware data from BIOS.
  • ifconfig command : Outdated network config
    • $ifconfig -a
    • $ip link show
    • $ip a
  • ip command : Recommended new network config .
    • $ip a show wlp82s0
    • $ip -br -c link show # To list all interface, link status, MAC address, etc
    • $ip -br -c addr show # similar list with IP address instead of MAC Address
  • hwinfo command : Probe Linux for network cards.
    • $sudo hwinfo --network --short
  • ethtool command : See NIC/card driver and settings on Linux.
    • $sudo ethtool -i eno1
    • $sudo ethtool -i enp0s31f6
  • /proc/net/dev file - The dev pseudo-file contains network device status information. This gives the number of received and sent packets, the number of errors and collisions and other basic statistics
    • $cat /proc/net/dev

Failed to set iommu for container: Invalid argument

A VM configured with a vGPU that supports SR-IOV may fail to start, This issue occurs because PCIe AER (Advanced Error Reporting) support was disabled in the BIOS settings of the server.

Reference