NFS

From HPCWIKI
Jump to navigation Jump to search

NFS is the most widely used HPC filesystem. It is very easy to set up and performs reasonably well for small to medium clusters as primary storage. You can even use it for larger clusters. One of the most common questions about NFS configuration is how to tune it for performance and management and what options are typically used. So it is important to know them NFS export and mount options especially when you are facing a performance issue or a functional issue with the NFS mount over network.

Basic command

Commands Description Command on
# exportfs -r Re-export your shares Server
# exportfs -a Export your shares Server
# exportfs -v Verify the NFS Share permissions Server
$nfsstat -m Verify Current NFS Mount Options Client

NFS export on Server[1]

NFS exports options are the permissions that is applied on NFS Server when we create a NFS Share under /etc/exports

Here are most common(important) options that administrator must understand, full list of options are available on the man pages

Export Options NFS Server Default
secure/insecure NFSv4 only uses port 2049 while to check the list of ports used by NFSv3 use in server port

With secure the port number from which the client requests a mount must be lower than 1024. To allow client any available free port use insecure in the NFS share

secure
rw/ro ro means read-only access to the NFS Share

rw means read write access to the NFS Share

rw
root_squash/no_root_squash squash literally means to squash (destroy) the power of the remote root user.


root_squash prevents remote root users from having superuser (root) privileges on remote NFS-mounted volumes.

no_root_squash allows root user on the NFS client host to access the NFS-mounted directory with the same rights and privileges that the superuser would normally have

root_squash
all_quash/no_all_squash all_squash will map all User IDs (UIDs) and group IDs (GIDs) to the anonymous user. all_squash is useful for NFS-exported public FTP directories, news spool directories no_all_squash
sync/aysnc sync reply to requests are done only after the changes have been committed to stable storage

async allows the NFS server to violate the NFS protocol and reply to requests before any changes made by that request have been committed to stable storage

Using aysnc option usually improves performance, but at the cost that an unclean server restart (i.e. a crash) can cause data to be lost or corrupted

sync

Check exports list and options

#with following set
$ cat /etc/exports 
/nas * (rw,sync,no_subtree_check)

$ rpcinfo -p | grep -i nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100003    3   udp   2049  nfs
    
# detailed export status with default export options
$ sudo exportfs -v
/nas   <world>(ro,wdelay,root_squash,no_subtree_check,sec=sys,ro,secure,root_squash,no_all_squash)

NFS mount on Client[2]

# mount -t nfs -o [options] remote:/nfs /mount

Mount Options NFS Client Default Notes
nfsvers=n or vers=n The version of the NFS protocol to use. By default, the local NFS client will attempt to mount the file system using NFS version 3. If the NFS server does not support version 3, the file system will be mounted using version 2.

If you know that the NFS server does not support version 3, specify vers=2, and you will save time during the mount, because the client will not attempt to use version 3 before using version 2

3
rw (read/write) / ro (read-only)
  • Use rw for data that users need to modify. In order for you to mount a directory read/write, the NFS server must export it read/write.
  • Use ro for data you do not want users to change. A directory that is automounted from several servers should be read-only, to keep versions identical on all servers.
rw.
suid / nosuid
  • Specify suid if you want to allow mounted programs that have setuid permission to run with the permissions of their owners, regardless of who starts them. If a program with setuid permission is owned by root, it will run with root permissions, regardless of who starts it.
  • Specify nosuid to protect your system against setuid programs that may run as root and damage your system.
suid
hard / soft
  • Specify hard if users will be writing to the mounted directory or running programs located in it. When NFS tries to access a hard-mounted directory, it keeps trying until it succeeds or someone interrupts its attempts. If the server goes down, any processes using the mounted directory hang until the server comes back up and then continue processing without errors. Interruptible hard mounts may be interrupted with CTRL-C or kill (see the intr option, later).
  • Specify soft if the server is unreliable and you want to prevent systems from hanging when the server is down. When NFS tries to access a soft-mounted directory, it gives up and returns an error message after trying retrans times (see the retrans option, later). Any processes using the mounted directory will return errors if the server goes down.
hard
nconnect=<value> NFS-over-TCP mount for one or more NFS shares from an individual NFS server, the traditional behavior is that all those mounts share one TCP connection if they are using the same NFS protocol version.  In cases of high NFS work load at the client, this connection sharing may result in lower performance or unnecessary bottlenecks.[3] 16 In Linux kernel 5.3 (and higher), the nconnect option allows multiple TCP connections for a single NFS mount.
intr / nointr
  • Specify intr if users are not likely to damage critical data by manually interrupting an NFS request. If a hard mount is interruptible, a user may press [CTRL]-C or issue the kill command to interrupt an NFS mount that is hanging indefinitely because a server is down.
  • Specify nointr if users might damage critical data by manually interrupting an NFS request, and you would rather have the system hang while the server is down than risk losing data between the client and the server.
intr In Linux kernel 2.6.25 (and higher), the intr and nointr mount options are deprecated. If you use the hard option on modern Linux kernels, you must use the kill -9 (SIGKILL) command to interrupt a stuck NFS mount.
fg (foreground) / bg (background)
  • Specify fg for directories that are necessary for the client machine to boot or operate correctly. If a foreground mount fails, it is retried again in the foreground until it succeeds or is interrupted. All automounted directories are mounted in the foreground; you cannot specify the bg option with automounted directories.
  • Specify bg for mounting directories that are not necessary for the client to boot or operate correctly. Background mounts that fail are re-tried in the background, allowing the mount process to consider the mount complete and go on to the next one. If you have two machines configured to mount directories from each other, configure the mounts on one of the machines as background mounts. That way, if both systems try to boot at once, they will not become deadlocked, each waiting to mount directories from the other. The bg option cannot be used with automounted directories.
fg
devs / nodevs
  • Specify devs if you are mounting device files from a server whose device files will work correctly on the client. The devs option allows you to use NFS-mounted device files to read and write to devices from the NFS client. It is useful for maintaining a standard, centralized set of device files, if all your systems are configured similarly.
  • Specify nodevs if device files mounted from a server will not work correctly for reading and writing to devices on the NFS client. The nodevs option generates an error if a process on the NFS client tries to read or write to an NFS-mounted device file.
devs
timeo=n The timeo (timeout) option is the amount of time the NFS client waits on the NFS server before retransmitting a packet (no ACK received). The value for timeo is given in tenths of a second, so if timeo is 5, the NFS client will wait 0.5 seconds before retransmitting

The timeout, in tenths of a second, for NFS requests (read and write requests to mounted directories). If an NFS request times out, this timeout value is doubled, and the request is retransmitted. After the NFS request has been retransmitted the number of times specified by the retrans option, a soft mount returns an error, and a hard mount retries the request. The maximum timeo value is 30 (3 seconds). – Try doubling the timeo value if you see several servers not responding messages within a few minutes. This can happen because you are mounting directories across a gateway, because your server is slow, or because your network is busy with heavy traffic.

The default is 0.7 (0.07 seconds)
retrans=n retrans, specifies the number of tries the NFS client will make to retransmit the packet. If the value is 5, the client resends the RPC packet five times, waiting timeo seconds between tries. If, after the last attempt, the NFS server does not respond, you get a message Server not responding.

The number of times an NFS request (a read or write request to a mounted directory) is retransmitted after it times out. If the request does not succeed after n retransmissions, a soft mount returns an error, and a hard mount retries the request. Increase the retrans value for a directory that is soft-mounted from a server that has frequent, short periods of downtime. This gives the server sufficient time to recover, so the soft mount does not return an error

4
retry=n The number of times the NFS client attempts to mount a directory after the first attempt fails. If you specify intr, you can interrupt the mount before n retries. However, if you specify nointr, you must wait until n retries have been made, until the mount succeeds, or until you reboot the system.

If mounts are failing because your server is very busy, increasing the retry value may fix the problem

1
rsize=n The number of bytes the NFS client requests from the NFS server in a single read request.

If packets are being dropped between the client and the server, decrease rsize to 4096 or 2048. To find out whether packets are being dropped, issue the “nfsstat -rc” command at the HP-UX prompt. If the timeout and retrans values returned by this command are high, but the badxid number is close to zero, then packets are being dropped somewhere in the network.

8192
wsize=n The number of bytes the NFS client sends to the NFS server in a single write request.

If packets are being dropped between the client and the server, decrease wsize to 4096 or 2048. To find out whether packets are being dropped, issue the “nfsstat -rc” command at the HP-UX prompt. If the timeout and retrans values returned by this command are high, but the badxid number is close to zero, then packets are being dropped somewhere in the network.

8192
O (Overlay mount) Allows the file system to be mounted over an existing mount point, making the underlying file system inaccessible. If you attempt to mount a file system over an existing mount point without the -O option, the mount will fail with the error device busy.

Caution: Using the -O mount option can put your system in a confusing state. The -O option allows you to hide local data under an NFS mount point without receiving any warning. Local data hidden beneath an NFS mount point will not be backed up during regular system backups.

On HP-UX, the -O option is valid only for NFS-mounted file systems. For this reason, if you specify the -O option, you must also specify the -F nfs option to the mount command or the nfs file system type in the /etc/fstab file.

The default value is not specified
remount If the file system is mounted read-only, this option remounts it read/write. This allows you to change the access permissions from read-only to read/write without forcing everyone to leave the mounted directory or killing all processes using it The Default value is not specified
noac If specified, this option prevents the NFS client from caching attributes for the mounted directory.

Specify noac for a directory that will be used frequently by many NFS clients. The noac option ensures that the file and directory attributes on the server are up to date, because no changes are cached on the clients. However, if many NFS clients using the same NFS server all disable attribute caching, the server may become overloaded with attribute requests and updates. You can also use the actimeo option to set all the caching timeouts to a small number of seconds, like 1 or 3.

If you specify noac, do not specify the other caching options.

The Default value is not specified
nocto If specified, this option suppresses fresh attributes when opening a file.

Specify nocto for a file or directory that never changes, to decrease the load on your network

The Default value is not specified
acdirmax=n The maximum number of seconds a directory’s attributes are cached on the NFS client. When this timeout period expires, the client flushes its attribute cache, and if the attributes have changed, the client sends them to the NFS server.

For a directory that rarely changes or that is owned and modified by only one user, like a user’s home directory, you can decrease the load on your network by setting acdirmax=120 or higher

60
acdirmin=n The minimum number of seconds a directory’s attributes are cached on the NFS client. If the directory is modified before this timeout expires, the timeout period is extended by acdirmin seconds.

For a directory that rarely changes or that is owned and modified by only one user, like a user’s home directory, you can decrease the load on your network by setting acdirmin=60 or higher

30
acregmax=n The maximum number of seconds a file’s attributes are cached on the NFS client. When this timeout period expires, the client flushes its attribute cache, and if the attributes have changed, the client sends them to the NFS server.

For a file that rarely changes or that is owned and modified by only one user, like a file in a user’s home directory, you can decrease the load on your network by setting acregmax=120 or higher

60
actimeo=n Setting actimeo to n seconds is equivalent to setting acdirmax, acdirmin, acregmax, and acregmin to n seconds.

Set actimeo=1 or actimeo=3 for a directory that is used and modified frequently by many NFS clients. This ensures that the file and directory attributes are kept reasonably up to date, even if they are changed frequently from various client locations.

Set actimeo=120 or higher for a directory that rarely or never changes.

If you set the actimeo value, do not set the acdirmax, acdirmin, acregmax, or acregmin values

The Default value is not specified
grpid Forces a newly created file in the mounted file system to inherit the group ID of the parent directory.

By default, a newly created file inherits the effective group ID of the calling process, unless the GID bit is set on the parent directory. If the GID bit is set, the new file inherits the group ID of the parent directory

The Default value is not specified
lock / nolock
Selects whether to use the NLM sideband protocol to lock files on the server. If neither option is specified (or if lock is specified), NLM locking is used for this mount point. When using the nolock option, applications can lock files, but such locks provide exclusion only against other applications running on the same client. Remote applications are not affected by these locks.

NLM locking must be disabled with the nolock option when using NFS to mount /var because /var contains files used by the NLM implementation on Linux. Using the nolock option is also required when mounting exports on NFS servers that do not support the NLM protocol.  

local_lock=mechanism Specifies whether to use local locking for any or both of the flock and the POSIX locking mechanisms. mechanism can be one of all, flock, posix, or none.

The Linux NFS client provides a way to make locks local. This means, the applications can lock files, but such locks provide exclusion only against other applications running on the same client. Remote applications are not affected by these locks.

If all is specified, the client assumes that both flock and POSIX locks are local.

If flock is specified, the client assumes that only flock locks are local and uses NLM sideband protocol to lock files when POSIX locks are used.

If posix is specified, the client assumes that POSIX locks are local and uses NLM sideband protocol to lock files when flock locks are used.

To support legacy flock behavior similar to that of NFS clients < 2.6.12, use Samba as Samba maps Windows share mode locks as flock. Since NFS clients > 2.6.12 implement flock by emulating POSIX locks, this will result in conflicting locks.

none This option is supported in kernels 2.6.37 and later.

NOTE: When used together, the 'local_lock' mount option will be overridden by 'nolock'/'lock' mount option.  

Optimizing NFS Performance[4]

Tuning for performance is a loaded question because performance is defined by so many different variables, the most important of which is how to measure performance.

Tunning Options Target Description Notes Recommendation
NFS performance Synchronous vs asynchronous See above to take effect remount exist mout point sync for data integrity
Number of NFS daemons (nfsd) One way to determine whether more NFS threads helps performance is to check the data in cat /proc/net/rpc/nfsd (Ubuntu) for the load on the NFS daemons.

The output line that starts with th lists the number of threads, and the last 10 numbers are a histogram of the number of seconds the first 10% of threads were busy, the second 10%, and so on.


this page explain how to part the contents of /proc/net/rpc/nfsd.


Ideally, you want the last two numbers to be zero or close to zero, indicating that the threads are busy and you are not "wasting" any threads. If the last two numbers are fairly high, you should add NFS daemons, because the NFS server has become the bottleneck. If the last two, three, or four numbers are zero, then some threads are probably not being used[5]

on Ubuntu, RPCNFSDCOUNT in /etc/default/nfs-kernel-server file tells you the number of NFS daemons for the server

in addition, for tuning how many threads are needed you could look at : /proc/fs/nfsd/pool_stats

to take effect reboot system[6] 256 on 16C/128GB

64 on 8C/64GB

Block Size Setting Two NFS client options specify the size of data chunks for writing (wsize) and reading (rsize). If you don't specify the chunk sizes, the defaults are determined by the versions of NFS and the kernel being used

the best way to check the current chunk size is to run the command on the NFS client and look for the wsize and rsize values.

$cat /proc/mounts

Timeout and Retransmission On congested networks, you often see retransmissions of RPC packets. A good way to tell is to run the
nfsstat -r

command and look for the column labeled retrans. If the number is large, the network is likely very congested. If that is the case, you might want to increase the values of timeo and retrans to increase the number of tries and the amount of time between RPC tries. Although taking this action will slow down NFS performance, it might help even out the network traffic so that congestion is reduced. In my experience, getting rid of congestion and dropped packets can result in better, more even performance

FS-Cache FS-Cache option enable caches NFS client requests on a local storage device, such as a hard drive or SSD, helping improve NFS read I/O: Data that resides on the local NFS client means the NFS server does not have to be contacted.

To use NFS caching you have to enable it explicitly by adding the option -o fsc to the mount command or in /etc/fstab:

# mount <nfs-share>:/ </mount/point> -o fsc

Any data access to </mount/point> will go through the NFS cache unless the file is opened for direct I/O or if a write I/O is performed.

The important thing to remember is that FS-Cache only works if the I/O is a read. FS-Cache can't help with a direct I/O (read or write) or an I/O write request. However, there are plenty of cases in which FS-Cache can help. For example, if you have an application that needs to read from a database or file and you are running a large number of copies of the same application, FS-Cache might help, because each node could cache the database or file.

Filesystem-independent mount options Linux mount command has a number of options that are independent of the filesystem and might be able to improve performance
  • noatime – Inode access times are not updated on the filesystem. This can help performance because the access time of the file is not updated every time a file is accessed.
  • nodiratime – The directory inode is not updated on the filesystem when it is accessed. This can help performance in the same way as not updating the file access time.
  • relatime – Inode access times are relative to the modify or change time for the file, so the access time is updated only if the previous atime (access time) was earlier than the modify or change time.
System tuning System Memory If you are choosing to use asynchronous NFS mode, you will need more memory to take advantage of async, because the NFS server will first store the I/O request in memory, respond to the NFS client, and then retire the I/O by having the filesystem write it to stable storage. Therefore, you need as much memory as possible to get the best performance.
MTU Changing the network MTU (maximum transmission unit) is also a good way to affect performance, but it is not an NFS tunable.

The MTU size can be very important because it determines packet fragments on the network. If your chunk size is 8KB and the MTU is 1500, it will take six Ethernet frames to transmit the 8KB. If you increase the MTU to 9000 (9,000 bytes), the number of Ethernet frames drops to one.

A study by Dell a few years back examined the effect of an MTU of 1500 compared with an MTU of 9000. Using Netperf, they found that the bandwidth increased by about 33% when an MTU of 9000 was used

Fortunatly most switches can accommodate an MTU of 9000 (commonly called "jumbo packets")

TCP tuning on the server The NFS server NFS daemons share the same socket input and output queues, so if the queues are larger, all of the NFS daemons have more buffer and can send and receive data much faster.

the two values to increase queues for input and output are defined the value in /proc/sys/net/core/rmem_default (the default size of the read queue in bytes) /proc/sys/net/core/rmem_max (the maximum size of the read queue in bytes)

To make the values survive reboots, you need to enter them in the proper form in the /etc/sysctl.conf file or /etc/sysctl.d/somefile

NFS management/policy Subtree checking subtree_check to the exports on the NFS server checks that the file being accessed is contained within the exported directory and force the NFS server to check that the requested file was located within exported folder

Many people have the opinion that subtree_check can have a big effect on performance, but the final determination is up to that is performance more important than security for the configuration and your situation

For security, it is recommended to export folders which uses separate partition or separate drive to prevent a rogue user from guessing a file handle to anything outside of the filesystem

Root squashing if you want root to have access to an NFS-mounted filesystem, you can add the option no_root_squash to the file /etc/exports to allow root access. Just be aware that if someone reboots your system to gain root access, it's possible for them to copy (steal) data.

Setting Block Size to Optimize Transfer Speeds

The mount command options rsize and wsize specify the size of the chunks of data that the client and server pass back and forth to each other.

Mount Options Example[7]

  • In Linux kernel 5.3 (and higher), the nconnect option allows multiple TCP connections for a single NFS mount. Note: Currently, the maximum number of concurrent TCP connections is 16.
  • In Linux kernel 2.6.25 (and higher), the intr and nointr mount options are deprecated. If you use the hard option on modern Linux kernels, you must use the kill -9 (SIGKILL) command to interrupt a stuck NFS mount.
Condition Recommended Options
Client with Server-Side Network Lock Manager (NLM) Enabled
mount -t nfs -o rsize=65536,wsize=65536,intr,hard,tcp,rdirplus,readahead=128 server:/path mountpath
Client with Local Locking Enforced
mount -t nfs   -o rsize=65536,wsize=65536,intr,hard,tcp,locallocks,rdirplus,readahead=128 \ server:/path mountpath


References