THP (Transparent Huge Pages)

From HPCWIKI
Jump to navigation Jump to search

Transparent Huge Pages (THP) is a Linux memory management system that reduces the overhead of Translation Lookaside Buffer (TLB) lookups on machines with large amounts of memory by using larger memory pages.


THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task’s address space. Unless THP is completely disabled, there is khugepaged daemon that scans memory and collapses sequences of basic pages into huge pages

In Linux system, Memory is managed in blocks known as pages. A page is 4096 bytes. 1MB of memory is equal to 256 pages; 1GB of memory is equal to 256,000 pages, etc

CPUs have a built-in memory management unit that contains a list of these pages.

Two ways to enable the system to manage large amounts of memory:

  1. Increase the number of page table entries in the hardware memory management unit
  2. Increase the page size   

The first method is expensive, since the hardware memory management unit in a modern processor only supports hundreds or thousands of page table entries. Additionally, hardware and memory management algorithms that work well with thousands of pages (megabytes of memory) may have difficulty performing well with millions (or even billions) of pages. This results in performance issues: when an application needs to use more memory pages than the memory management unit supports, the system falls back to slower, software-based memory management, which causes the entire system to run more slowly. it can empact to performance critical computing applications dealing with large memory working sets.

THP by Workload

Type Comments on THP
Database database workloads often perform poorly with THP enabled, because they tend to have sparse rather than contiguous memory access patterns. When running MongoDB on Linux, THP should be disabled for best performance.[1]

a Linux memory management feature, often slows down database performance[2]

Oracle Transparent HugePages can cause memory allocation delays during runtime. To avoid performance issues, Oracle recommends that you disable Transparent HugePages on all Oracle Database servers[3]

Configure Huge Pages

huge pages are blocks of memory that come in 2MB and 1GB sizes. The page tables used by the 2MB pages are suitable for managing multiple gigabytes of memory, whereas the page tables of 1GB pages are best for scaling to terabytes of memory. Huge pages require contiguous areas of memory, so allocating them at boot is the most reliable method since memory has not yet become fragmented.

Check THP status

Check Transparent Huge Pages status by

cat /sys/kernel/mm/transparent_hugepage/enabled

Meaning of output


Copy to Clipboard

always madvise [never]

Disable THP temperary without rebooting

# echo never >  /sys/kernel/mm/transparent_hugepage/enabled

The transparent_hugepage/enabled & disable option only affect future behavior. So to make them effective you need to restart any application that could have been using hugepages. This also applies to the regions registered in khugepaged.

Disable THP using GRUB

Edit /etc/default/grub to add transparent_hugepage=never to the GRUB_CMDLINE_LINUX_DEFAULT option

 GRUB_CMDLINE_LINUX_DEFAULT="transparent_hugepage=never quiet splash"

After that, run update-grub command. **Need reboot to take effect

Disable THP using rc.local

Edit /etc/rc.local and put following script before exit 0

if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
    echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi

Disable THP using systemd service

[Unit]
Description=Disable Transparent Huge Pages (THP)
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=mongod.service  #Any service name to make sure disable THP before they start
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never | tee /sys/kernel/mm/transparent_hugepage/enabled > /dev/null'
[Install]
WantedBy=basic.target

Needs to reload daemon, start and enable the service

THP kernel parameters

Parameter Description
hugepages Defines the number of persistent huge pages configured in the kernel at boot time. The default value is 0.

It is only possible to allocate (or deallocate) huge pages if there are sufficient physically contiguous free pages in the system. Pages reserved by this parameter cannot be used for other purposes. Default size huge pages can be dynamically allocated or deallocated by changing the value of the /proc/sys/vm/nr_hugepages file.

In a NUMA system, huge pages assigned with this parameter are divided equally between nodes. You can assign huge pages to specific nodes at runtime by changing the value of the node's /sys/devices/system/node/node_id/hugepages/hugepages-1048576kB/nr_hugepages file.

For more information, read the relevant kernel documentation, which is installed in /usr/share/doc/kernel-doc-kernel_version/Documentation/vm/hugetlbpage.txt by default. This documentation is available only if the kernel-doc package is installed.

hugepagesz Defines the size of persistent huge pages configured in the kernel at boot time. Valid values are 2 MB and 1 GB. The default value is 2 MB.
default_hugepagesz Defines the default size of persistent huge pages configured in the kernel at boot time. Valid values are 2 MB and 1 GB. The default value is 2 MB.     

Monitoring parameters

File Field Description Notes
/proc/meminfo AnonHugePages The number of anonymous transparent huge pages currently used by the system

To identify what applications are using anonymous transparent huge pages, it is necessary to read /proc/PID/smaps and count the AnonHugePages fields for each mapping

** Note that reading the smaps file is expensive and reading it frequently will incur overhead.
ShmemPmdMapped The number of file transparent huge pages mapped to userspace.

To identify what applications are mapping file transparent huge pages, it is necessary to read /proc/PID/smaps and count the FileHugeMapped fields for each mapping

/proc/vmstat used to monitor how successfully the system is providing huge pages for use
thp_fault_alloc
is incremented every time a huge page is successfully allocated to handle a page fault.
thp_collapse_alloc
is incremented by khugepaged when it has found a range of pages to collapse into one huge page and has successfully allocated a new huge page to store the data.
thp_fault_fallback
is incremented if a page fault fails to allocate a huge page and instead falls back to using small pages.
thp_fault_fallback_charge
is incremented if a page fault fails to charge a huge page and instead falls back to using small pages even though the allocation was successful.
thp_collapse_alloc_failed
is incremented if khugepaged found a range of pages that should be collapsed into one huge page but failed the allocation.
thp_file_alloc
is incremented every time a file huge page is successfully allocated.
thp_file_fallback
is incremented if a file huge page is attempted to be allocated but fails and instead falls back to using small pages.
thp_file_fallback_charge
is incremented if a file huge page cannot be charged and instead falls back to using small pages even though the allocation was successful.
thp_file_mapped
is incremented every time a file huge page is mapped into user address space.
thp_split_page
is incremented every time a huge page is split into base pages. This can happen for a variety of reasons but a common reason is that a huge page is old and is being reclaimed. This action implies splitting all PMD the page mapped with.
thp_split_page_failed
is incremented if kernel fails to split huge page. This can happen if the page was pinned by somebody.
thp_deferred_split_page
is incremented when a huge page is put onto split queue. This happens when a huge page is partially unmapped and splitting it would free up some memory. Pages on split queue are going to be split under memory pressure.
thp_split_pmd
is incremented every time a PMD split into table of PTEs. This can happen, for instance, when application calls mprotect() or munmap() on part of huge page. It doesn’t split huge page, only page table entry.
thp_zero_page_alloc
is incremented every time a huge zero page used for thp is successfully allocated. Note, it doesn’t count every map of the huge zero page, only its allocation.
thp_zero_page_alloc_failed
is incremented if kernel fails to allocate huge zero page and falls back to using small pages.
thp_swpout
is incremented every time a huge page is swapout in one piece without splitting.
thp_swpout_fallback
is incremented if a huge page has to be split before swapout. Usually because failed to allocate some continuous swap space for the huge page.
As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in /proc/vmstat to help monitor this overhead.
compact_stall
is incremented every time a process stalls to run memory compaction so that a huge page is free for use.
compact_success
is incremented if the system compacted memory and freed a huge page for use.
compact_fail
is incremented if the system tries to compact memory but failed.

Reference