Linux Kernel Tuning
Swap Space
Swap space in Linux is used when the amount of physical memory (RAM) is full. Swap space can be a dedicated swap partition (recommended), a swap file, or a combination of swap partition and swap files. Note that Btrfs does not support swap space.
Althoug there is no specific guidelines in the size of swap space, recommended swap space is considered a function of system memory workload, not system memory for a modern systems which often include hundreds of gigabytes of RAM.
RedHat's recommendation for swap space is
Amount of RAM in the system | Recommended swap space | Recommended swap space if allowing for hibernation |
---|---|---|
⩽ 2 GB | 2 times the amount of RAM | 3 times the amount of RAM |
> 2 GB – 8 GB | Equal to the amount of RAM | 2 times the amount of RAM |
> 8 GB – 64 GB | At least 4 GB | 1.5 times the amount of RAM |
> 64 GB | At least 4 GB | Hibernation not recommended |
See details at DLS validation > Generating a virtual memory pressure
Linux kernel tuning for GlusterFS[1]
vm.swappiness
vm.swappiness is a tunable kernel parameter that controls how much the kernel favors swap over RAM. At the source code level, it’s also defined as the tendency to steal mapped memory. A high swappiness value means that the kernel will be more apt to unmap mapped pages. A low swappiness value means the opposite, the kernel will be less apt to unmap mapped pages. In other words, the higher the vm.swappiness value, the more the system will swap.
High system swapping has very undesirable effects when there are huge chunks of data being swapped in and out of RAM. Many have argued for the value to be set high, but in my experience, setting the value to '0' causes a performance increase.
Conforming with the details here but again these changes should be driven by testing and due diligence from the user for their own applications. Heavily loaded, streaming apps should set this value to '0'. By changing this value to '0', the system's responsiveness improves.
vm.swappiness takes a value between 0 and 100 to change the balance between swapping applications and freeing cache. At 100, the kernel will always prefer to find inactive pages and swap them out; in other cases, whether a swapout occurs depends on how much application memory is in use and how poorly the cache is doing at finding and releasing inactive items.[2]
To check the swappiness value
cat /proc/sys/vm/swappiness
To change swapping behavior, use either echo or sysctl
sudo sysctl vm.swappiness=10
To make a change permanent, edit the configuration file etc/sysctl.conf
vm.swappiness=10
the apply sudo sysctl --load=/etc/sysctl.conf
vm.vfs_cache_pressure
This option controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects.
At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes.
With GlusterFS, many users with a lot of storage and many small files easily end up using a lot of RAM on the server side due to 'inode/dentry' caching, leading to decreased performance when the kernel keeps crawling through data-structures on a 40GB RAM system. Changing this value higher than 100 has helped many users to achieve fair caching and more responsiveness from the kernel.
vm.dirty_background_ratio
vm.dirty_ratio
The first of the two (vm.dirty_background_ratio) defines the percentage of memory that can become dirty before a background flushing of the pages to disk starts. Until this percentage is reached no pages are flushed to disk. However when the flushing starts, then it's done in the background without disrupting any of the running processes in the foreground.
Now the second of the two parameters (vm.dirty_ratio) defines the percentage of memory which can be occupied by dirty pages before a forced flush starts. If the percentage of dirty pages reaches this threshold, then all processes become synchronous, and they are not allowed to continue until the io operation they have requested is actually performed and the data is on disk. In cases of high performance I/O machines, this causes a problem as the data caching is cut away and all of the processes doing I/O become blocked to wait for I/O. This will cause a large number of hanging processes, which leads to high load, which leads to an unstable system and crappy performance.
Lowering them from standard values causes everything to be flushed to disk rather than storing much in RAM. It helps large memory systems, which would normally flush a 45G-90G pagecache to disk, causing huge wait times for front-end applications, decreasing overall responsiveness and interactivity.
"1" > /proc/sys/vm/pagecache
Page Cache is a disk cache which holds data from files and executable programs, i.e. pages with actual contents of files or block devices. Page Cache (disk cache) is used to reduce the number of disk reads. A value of '1' indicates 1% of the RAM is used for this, so that most of them are fetched from disk rather than RAM. The lower the percentage, the more the system favors reclaiming unmapped pagecache memory over mapped memory. High values (like the default value of 100
) are not recommended for databases
#The pagecache parameters can be changed in the proc file system without reboot
$sudo echo "40" > /proc/sys/vm/pagecache
# or use sysctl(8) to change it
$sudo sysctl -w vm.pagecache="40"
#To make the change permanent then reboot
echo "vm.pagecache=40" >> /etc/sysctl.conf
To control the percentage of total memory used for page cache change the pagecache
kernel parameter.
"deadline" > /sys/block/sdc/queue/scheduler
The I/O scheduler is a component of the Linux kernel which decides how the read and write buffers are to be queued for the underlying device. Theoretically 'noop' is better with a smart RAID controller because Linux knows nothing about (physical) disk geometry, therefore it can be efficient to let the controller, well aware of disk geometry, handle the requests as soon as possible. But 'deadline' seems to enhance performance. You can read more about them in the Linux kernel source documentation: linux/Documentation/block/*iosched.txt . I have also seen 'read' throughput increase during mixed-operations (many writes).
"256" > /sys/block/sdc/queue/nr_requests
This is the size of I/O requests which are buffered before they are communicated to the disk by the Scheduler. The internal queue size of some controllers (queue_depth) is larger than the I/O scheduler's nr_requests so that the I/O scheduler doesn't get much of a chance to properly order and merge the requests. Deadline or CFQ scheduler likes to have nr_requests to be set 2 times the value of queue_depth, which is the default for a given controller. Merging the order and requests helps the scheduler to be more responsive during huge load.
echo "16" > /proc/sys/vm/page-cluster
page-cluster controls the number of pages which are written to swap in a single attempt. It defines the swap I/O size, in the above example adding '16' as per the RAID stripe size of 64k. This wouldn't make sense after you have used swappiness=0, but if you defined swappiness=10 or 20, then using this value helps when your have a RAID stripe size of 64k.
blockdev --setra 4096 /dev/ (eg:- sdb, hdc or dev_mapper)
Default block device settings often result in terrible performance for many RAID controllers. Adding the above option, which sets read-ahead to 4096 * 512-byte sectors, at least for the streamed copy, increases the speed, saturating the HD's integrated cache by reading ahead during the period used by the kernel to prepare I/O. It may put in cached data which will be requested by the next read. Too much read-ahead may kill random I/O on huge files if it uses potentially useful drive time or loads data beyond caches.
A few other miscellaneous changes which are recommended at filesystem level but haven't been tested yet are the following. Make sure that your filesystem knows about the stripe size and number of disks in the array. E.g. for a raid5 array with a stripe size of 64K and 6 disks (effectively 5, because in every stripe-set there is one disk doing parity). These are built on theoretical assumptions and gathered from various other blogs/articles provided by RAID experts.
-> ext4 fs, 5 disks, 64K stripe, units in 4K blocks
mkfs -text4 -E stride=\$((64/4))
-> xfs, 5 disks, 64K stripe, units in 512-byte sectors
mkfs -txfs -d sunit=\$((64*2)) -d swidth=\$((5*64*2))
You may want to consider increasing the above stripe sizes for streaming large files.
WARNING: Above changes are highly subjective with certain types of applications. This article doesn't guarantee any benefits whatsoever without prior due diligence from the user for their respective applications. It should only be applied at the behest of an expected increase in overall system responsiveness or if it resolves ongoing issues.
Reference
- http://dom.as/2008/02/05/linux-io-schedulers/
- http://www.nextre.it/oracledocs/oraclemyths.html
- https://lkml.org/lkml/2006/11/15/40
- http://misterd77.blogspot.com/2007/11/3ware-hardware-raid-vs-linux-software.html