Kernel tips and tricks: Difference between revisions
No edit summary |
No edit summary |
||
Line 3: | Line 3: | ||
# Identify the specific hardware error source using mcelog or the APEI kernel log (/var/log/kern.log.x) | # Identify the specific hardware error source using mcelog or the APEI kernel log (/var/log/kern.log.x) | ||
# Update System Firmware - Manufacturers often release firmware updates to address hardware error-related issues. Applying the latest firmware ([[BIOS]]/UEFI) can resolve known APEI generic hardware error problems | # Update System Firmware - Manufacturers often release firmware updates to address hardware error-related issues. Applying the latest firmware ([[BIOS]]/UEFI/[[device firmware]]) can resolve known APEI generic hardware error problems | ||
# Kernel Parameter Tuning - although this is not recommended as it may compromise system stability. “apei=off” kernel parameter can be used to disable APEI error handling altogether | # Kernel Parameter Tuning - although this is not recommended as it may compromise system stability. “apei=off” kernel parameter can be used to disable APEI error handling altogether | ||
# Hardware Diagnostics - memtest86 for memory testing or CPU stress tests for CPU-related issues | # Hardware Diagnostics - memtest86 for memory testing or CPU stress tests for CPU-related issues |
Revision as of 10:22, 18 February 2024
APEI (ACPI Platform Error Interface) generic hardware error[1]
- Identify the specific hardware error source using mcelog or the APEI kernel log (/var/log/kern.log.x)
- Update System Firmware - Manufacturers often release firmware updates to address hardware error-related issues. Applying the latest firmware (BIOS/UEFI/device firmware) can resolve known APEI generic hardware error problems
- Kernel Parameter Tuning - although this is not recommended as it may compromise system stability. “apei=off” kernel parameter can be used to disable APEI error handling altogether
- Hardware Diagnostics - memtest86 for memory testing or CPU stress tests for CPU-related issues
sched: RT throttling activated
Symptoms
The "RT throttling activated" is a message from the operating system scheduler has identified some Real-Time threads that are hogging the CPU and starving other threads. The operating system does this in an attempt to throttle those real-time tasks and keep the operating system from being unresponsive.[2]
From the kernel point of view, when a RT thread occupied the cpu by 950ms/1s (defined by /proc/sys/kernel/sched_rt_period_us and /proc/sys/kernel/sched_rt_runtime_us), whether the RT thread maybe is a business thread or other unknown thread. Current Linux kernel only outputs the print "sched: RT throttling activated" when RT throttling happen and it is hard to know what is the RT thread.
Simply, a normal thread can't get cpu, And at this moment, Kernel prints 'sched: RT throttling activated' log
- Linux patch to print more infor : For further analysis A Linux Kernel patch is available to print current RT task when RT throttling activated that help us to know what is the RT thread in the first time.
- To reach a only 50% CPU usage for real-time tasks and a larger period the values can be changed with the following commands[3]
# echo 2000000 > /proc/sys/kernel/sched_rt_period_us # echo 1000000 > /proc/sys/kernel/sched_rt_runtime_us
- Real-time throttling is disabled in case the real-time task runtime has the same length than the the period. This is done automatically by writing `-1` into `sched_rt_runtime_us`:
# echo -1 > /proc/sys/kernel/sched_rt_runtime_us
This mechanism is already implemented in mainline Linux.
perf interrupt took too long in system log
Symptoms
The linux kernel gathers samples using ‘perf’ performance monitor without affecting the latencies. These include getting interrupt times. If interrupts take too long, a similar message to this prints:
system log shows "perf interrupt took too long (aaa > bbb), lowering kernel.perf_event_max_sample_rate to ccc"
This essentially means that the machine was stuck on an interrupt for a long amount of time. This can be caused by a number of reasons[4], including:
- DISK IO interrupt taking long would be caused by a faulty, slow or overloaded disk. Alternatively this can be caused by an issue with a disk or raid controller.
- Network IO interrupt taking too long would be caused most often by network driver issues being suboptimal. Alternatively, this can be caused by network issues, although the protocol switching should theoretically be preventing it.
Troubleshooting
The disk IO can be easily checked with disk IO stats (sysstat-sar and/or iostat) and confirmed. If the disk IO is not the reason for slow interrupts, the network IO will be. For this, the problem needs to be checked on the network and/or kernel side.
If there is no issue with the kernel drivers, the network would be most liekly at fault, most likely first hop. This then needs to be checked on the network side
Impact
This should not be a concern. perf is a tool to handle CPU performance. It has to do with the Linux perf tool which is included in the kernel. The kernel automagically determines the sample rate that could be used without impacting system performance too much; and it logs this even when perf isn't active, or even installed. Messages like this are triggered by high(er) system load or a cpu that is scaling[5]
References
- ↑ https://medium.com/@nothanjack/dealing-with-apei-generic-hardware-error-source-problems-in-linux-a8ee8a67c8c1
- ↑ https://www.dell.com/support/kbdoc/ko-kr/000167765/scaleio-resource-contention-troubleshooting
- ↑ https://wiki.linuxfoundation.org/realtime/documentation/technical_basics/sched_rt_throttling
- ↑ https://discuss.aerospike.com/t/what-does-interrupt-took-too-long-mean/5818
- ↑ https://bbs.archlinux.org/viewtopic.php?id=187636