FAQ: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
No edit summary
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
== System reboot - by Software or Hardware ? ==
== System reboot - by Software or Hardware ? ==
If your system reboot randomly without kernel logs, the most possible reason would be power supply stability issue although the PSU might looks working well.
Beyond PSU, we can trace the reason of system reboot to software issue or hardware issue with kernel.panic parameters, 
If kernel.panic system parameter is <code>0,</code> it is turned off automatic reboot on panic, any other value is the number of seconds it wait before reboot.  
If kernel.panic system parameter is <code>0,</code> it is turned off automatic reboot on panic, any other value is the number of seconds it wait before reboot.  


Line 76: Line 80:


A VM configured with a vGPU that supports SR-IOV may fail to start, This issue occurs because [[PCIe]] [[AER (Advanced Error Reporting)]] [[support]] was disabled in the [[BIOS]] settings of the server.
A VM configured with a vGPU that supports SR-IOV may fail to start, This issue occurs because [[PCIe]] [[AER (Advanced Error Reporting)]] [[support]] was disabled in the [[BIOS]] settings of the server.
== Identify network driver on Linux ==
There are bunch of way to identify network driver on your system<ref>https://unix.stackexchange.com/questions/41817/linux-how-to-find-the-device-driver-used-for-a-device</ref>. one of the shortest way would identify device on PCI bus then find driver that uses by the device. <syntaxhighlight lang="shell">
#Identify PCI device using lspci
$ lspci | grep -i eth
e4:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
e4:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
#Identify device driver using the PCI bus number
$ find /sys | grep drivers.*e4
/sys/bus/pci/drivers/igb/0000:e4:00.0
/sys/bus/pci/drivers/igb/0000:e4:00.1
#Use ethtool to get more information
$ ethtool -i <interfacename>
driver: igb
version: 5.6.0-k
firmware-version: 1.63, 0x80000a05
expansion-rom-version:
bus-info: 0000:e4:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
</syntaxhighlight>


== Reference ==
== Reference ==
<references />
<references />

Latest revision as of 10:12, 15 May 2023

System reboot - by Software or Hardware ?

If your system reboot randomly without kernel logs, the most possible reason would be power supply stability issue although the PSU might looks working well.

Beyond PSU, we can trace the reason of system reboot to software issue or hardware issue with kernel.panic parameters,

If kernel.panic system parameter is 0, it is turned off automatic reboot on panic, any other value is the number of seconds it wait before reboot.

With sysctl -w kernel.panic=0 you would turn it off, if it is not already off.


If this is set to 0 and your server still reboots itself, it would really think this is a hardware issue. If this stops the automatic rebooting, then we know the reboot is caused by a watchdog timer or other software issue

Failed to load plugin io.containerd and could not use snapshotter

  • Reason - warning or information from the snapshotter[1] - image storage - that we have a lot of choices
  • Impact : the warning log doesn't impact the whole system operating
  • Solve to

1.Disable the snapshotter plugins which you don't need by updating config file for your system and restart containerd, like

<# /etc/containerd/config.toml
disabled_plugins = ["cri", "btrfs"]

2. To use ZFS, you need to mount ZFS dataset on /var/lib/containerd/io.containerd.snapshotter.v1.zfs

3. To use btrfs, you need to mount btrfs to /var/lib/containerd/io.containerd.snapshotter.v1.btrfs

4. For aufs, you need to modprobe it as explained in the error log

Could not select device driver "" with capabilities: GPU

  • Reason - no nvidia-container-toolkit or currupt exist package
  • Solve to install/reinstall nvidia-container-toolkit then restart docker daemon
$distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ 
    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ 
    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list 

$sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

$sudo systemctl restart docker

Pytorch FAQ

  • How to get CUDA compute capability of a GPU?
    • $python -c "import torch; print(torch.cuda.get_arch_list())"

Show List Of Network Cards on Linux

  • lspci command : List all PCI devices.
    • #lspci | egrep -i --color 'network|ethernet'
    • #lspci | egrep -i --color 'network|ethernet|wireless|wi-fi'
  • lshw command : Linux identify Ethernet interfaces and NIC hardware.
    • #lshw -class network
    • $sudo lshw -class network -short
  • dmidecode command : List all hardware data from BIOS.
  • ifconfig command : Outdated network config
    • $ifconfig -a
    • $ip link show
    • $ip a
  • ip command : Recommended new network config .
    • $ip a show wlp82s0
    • $ip -br -c link show # To list all interface, link status, MAC address, etc
    • $ip -br -c addr show # similar list with IP address instead of MAC Address
  • hwinfo command : Probe Linux for network cards.
    • $sudo hwinfo --network --short
  • ethtool command : See NIC/card driver and settings on Linux.
    • $sudo ethtool -i eno1
    • $sudo ethtool -i enp0s31f6
  • /proc/net/dev file - The dev pseudo-file contains network device status information. This gives the number of received and sent packets, the number of errors and collisions and other basic statistics
    • $cat /proc/net/dev

Failed to set iommu for container: Invalid argument

A VM configured with a vGPU that supports SR-IOV may fail to start, This issue occurs because PCIe AER (Advanced Error Reporting) support was disabled in the BIOS settings of the server.

Identify network driver on Linux

There are bunch of way to identify network driver on your system[2]. one of the shortest way would identify device on PCI bus then find driver that uses by the device.

#Identify PCI device using lspci
$ lspci | grep -i eth
e4:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
e4:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

#Identify device driver using the PCI bus number
$ find /sys | grep drivers.*e4
/sys/bus/pci/drivers/igb/0000:e4:00.0
/sys/bus/pci/drivers/igb/0000:e4:00.1

#Use ethtool to get more information 
$ ethtool -i <interfacename>
driver: igb
version: 5.6.0-k
firmware-version: 1.63, 0x80000a05
expansion-rom-version: 
bus-info: 0000:e4:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Reference