Network performance test

From HPCWIKI
Jump to navigation Jump to search

Speedtest

Using Speedtest.net tools to measure broadband up and down speed

#Debian family including Ubuntu
curl -s https://packagecloud.io/install/repositories/ookla/speedtest-cli/script.deb.sh | sudo bash
sudo apt-get install speedtest

#redhat family
curl -s https://packagecloud.io/install/repositories/ookla/speedtest-cli/script.rpm.sh | sudo bash
sudo yum install speedtest

$speedtest

Throughput

throughput means as data transfer rate or digital bandwidth consumption and denotes the achieved average useful bit rate in a com-

puter network over a physical communication link. so it is measured below the network layer and above the physical layer, e.g, data link layter in OSI 7 layer concept.

     Note Note:  In network terminology, one Kbit/s means 1,000 bit/s and not 1,024 bit/s

Transactions

A transaction is defined as a single reply for a single request. In this case, request/response performance is quoted as “transactions/s” for a given request and response size

Round trip time

Round trip time (RTT) is the total amount of time that a packet takes to reach the target destination and get back to the source address.

round-trip delay is typically between 1msec and 100 msec, which can be measured using ping or traceroute.

Bandwidth delay product

Bandwidth Delay Product (BDP) is an approximation for the amount of data that can be in flow in the network during a time slice.

To compute the BDP, it is required to know the speed of the slowest link in the path and the Round Trip Time (RTT) for the same path, where the bandwidth of a link is expressed in Gbit/s and the RTT in msec.

The BDP is very important in TCP/IP to tune the buffers in the receive and sender side. Both side need to have an available buffer Bigger than the BDP in order to allow the maximum available throughput, otherwise a packet overflow can happen because of out of free space.

Jumbo Frames

Jumbo frames are Ethernet frames that can carry more than the standard 1500 bytes of payload It was defined that jumbo frames can carry up to 9000 bytes, but some devices can support up to 16128 bytes per frame (called as Super Jumbo Frames).

Almost all 10 Gbit/s switches support jumbo frames, and new switches supports super jumbo frame


The size of the frame is directly impact to

  • increase the performance on the network, once the amount of overhead is smaller
  • the interface Maximum Transfer Unit (MTU), and it is a specific adapter configuration that must be set for each node in a network, all interfaces in the same network should have the same MTU in work to communicate properly, a different MTU on a specific node could cause awful issues.

MTU size can be changed $ifconfig <interface> mtu <size>

Jumbo frames not only reduce I/O overhead on end-hosts, they also improve the responsiveness of TCP by accelerating the congestion window increase by a factor of six compared to the standard MTU.

Setting the MTU to a high number does not mean that all your traffic would use jumbo packets. For example, a normal ssh session is almost insensible to a MTU change, once almost all SSH packets are sent using small frames

Remote DMA

Remote Direct Memory Access (RDMA) is an extension of DMA where it allows data to move, bypassing the kernel, from the memory of one computer into an other computer’s memory.

On Linux there is a project called OpenRDMA that provides an implementation of RDMA service layers.

RDMA layer is already implemented by common applications such as NFS

Transmission queue size

The transmission queue is the buffer that holds packets that is scheduled to be sent to the card.

The default 1000 packets value could not be enough and around 3000 would be idle depending of the network characteristics

SMP IRQ affinity

CPU and a I/O device communicates is through interrupts. On an SMP system, the specific CPU handling your interruption is very important for performance. In general, the round robin algorithm (IRQ balancing) is used to choose the CPU that will handle a specific interruption for a specific card.

/proc/interrupts can display all the interrupts lines, and which CPU handled the interruptions generated for each interrupt line, to achieve the best performance, it is recommended that all the interruptions generated by a device queue is handled by the same CPU, instead of IRQ balancing since the same IRQ handler function can be remain on the specific CPU cache.


We can stop IRQ balancing on SMP system using following command

$ systemctl disable --now irqbalance.service

After disabing IRQ balancing, we can bind an interrupt line to a specific CPU through /proc/<IRQ number>/smp_affinity, and can be changed any time on-the-fly. The content of the smp_affinity is a hexadecimal value that represents a group of CPU, for example, CPU0 (0x1) and CPU2 (0x4).

Taskset affinity

On a multi stream setup, it is advised that a task is bound to one or few CPUs, as described above for IRQ to prevent the task migrate from one CPU to other from time to time that cause cache missing.

In order to bind a task to a CPU, the command taskset should be used as follows

$ taskset -p 0x1 <PID>

Interrupt coalescence

interruption coalescing mechanisms to reduce the number of IRQs generated.

Enabling interruption coalescence could be done using the tool ethtool together with parameter -C as long as network driver supports. we can use modinfo command to discover the module parameter depending on the NIC.

Since Kernel version 2.6, “New API” (NAPI) was created in the linux kernel aiming to improve the performance of high-speed networking, and avoid interrupt storms.

Offload features

According to the 1 Hz per bit rule, one hertz of CPU is required to send or receive one bit of TCP/IP traffic, for example, a five gigabit per second of traffic in a network requires around five GHz of CPU for handling this traffic.

This implies that two entire cores of a 2.5 GHz multi-core processor will be required to handle the TCP/IP processing associated with five gigabit per second of TCP/IP traffic. Since Ethernet is bidirectional, it means that to send and receive 10 Gbit/s we needs eight 2.5 GHz cores to drive a 10 Gbit/s Ethernet network link.

As the link speed grows the more CPU cycle is required to be able to handle all the traffic in TCP/IP.

Good news is the network manufacturers started to offload a lot of repetitive tasks to the network card itself without CPU involving,

Based on the TCP stat, the manufactures classify offload features in two types - stateful and stateless

ethtool on Linux

Offload features can be checked and be configured using the ethtool tool.

Command Description
ethtool -s ethX speed 25000 autoneg off Force the speed to 25G. If the link is up on one port, the driver does not allow the other port to be set to a different speed.
ethtool -i ethX Output includes driver, firmware, and package version.
ethtool -k ethX Show offload features.
ethtool -K ethX tso off Turn off TSO.
ethtool -K ethX gro off lro off Turn off GRO/LRO.
ethtool -g ethX Show ring sizes.
ethtool -G ethX rx N Set ring sizes.
ethtool -S ethX Get statistics.
ethtool -l ethX Show number of rings.
ethtool -c ethX Display the current offload features of a network device
ethtool -C ethX rx-frames N Set interrupt coalescing. Other parameters supported are rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq, tx-usecs, tx-frames, tx-usecs- irq, tx-frames-irq.
ethtool -L ethX rx 0 tx 0 combined M Set number of rings.
ethtool -x ethX Show RSS flow hash indirection table and RSS key.
ethtool -s ethX autoneg on speed 10000 duplex full Enable Autoneg.
ethtool --show-eee ethX Show EEE state.
ethtool --set-eee ethX eee off Disable EEE.
ethtool --set-eee ethX eee on tx-lpi off Enable EEE, but disable LPI.
ethtool -L ethX combined 1 rx 0 tx 0 Disable RSS. Set the combined channels to 1.
ethtool -K ethX ntuple off Disable Accelerated RFS by disabling ntuple filters.
ethtool -K ethX ntuple on Enable Accelerated RFS.
ethtool -t ethX Performs various diagnostic self-tests.
echo 32768 > /proc/sys/net/core/ rps_sock_flow_entries

echo 2048 > /sys/class/net/ethX/queues/rx-X/

rps_flow_cnt

Enable RFS for ring X.
sysctl -w net.core.busy_read=50 This sets the time to read the device's receive ring to 50 μsecs. For socket applications waiting for data to arrive, using this method can decrease latency by 2 or 3 μs typically at the expense of higher CPU utilization.
echo 4 > /sys/class/net/<NAME>/device/sriov_numvfs Enable SR-IOV with four VFs on the named interface.
ip link set ethX vf 0 mac 00:12:34:56:78:9a Set VF MAC address.
ip link set ethX vf 0 state enable Set VF link state for VF 0.
ip link set ethX vf 0 vlan 100 Set VF 0 modprobe 8021q; ip link add link <NAME> name <VLAN Interface Name> type vlan id <VLAN ID>

Example:

modprobe 8021q; ip link add link ens3 name ens3.2 type vlan id 2

#check which features are available
$ sudo ethtool -k eno1np0
Features for eno1np0:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: on
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: on
tls-hw-record: off [fixed]

with this features we can check and enable RX or TX checksum.

# to enable the feature use -K (large) paramater on | off
$ sudo ethtool -K eno1np0 tx on
$ sudo ethtool -K eno1np0 rx on

# enable Scatter and gather
$sudo ethtool -K eno1np0 sg on

# enable TCP Segmentation Offload (TSO)
$sudo ethtool -K eno1np0 tso on

# enable Large Receive Offload (LRO)

# enable Generic Segmentation Offload (GSO)
$sudo ethtool -K eno1np0 gso on

With the enabing TX and RX offload, the amount of CPU saved depends on the packet size. Small packets have little or no savings with this option, while large packets have larger savings.

  • On the PCI-X gigabit adapters, it is possible to save around five percent in the CPU utilization when using a 1500 MTU. On the other hand, when using a 9000 MTU the savings is approximately around 15%.[1]
  • LRO features specific for each device driver and it is usually enabled using a module parameter, as lro_enable for s2io driver.
  • GSO like TSO is only effective if the MTU is significantly less than the maximum value of 64K

Kernel setting

Most network relative kernel setting can be done using sysctl through the /proc filesystem

# For temporary setting and test
$sudo sysctl -w <param> = <Value>

# To enable after boot, use /etc/sysctl.conf or /etc/sysctl.d/config_file 

# To see all the parameters related to network
$sudo sysctl -a | grep net

# To run custom script
$sudo sysctl -p /script/path

# Increase Rx buffer size
$sudo sysctl -w net.core.rmem_max=16777216
# enable TCP Windows Scaling
$sudo sysctl net.ipv4.tcp_window_scaling
# Disable the timestamp, remove eight bytes to the TCP header that overhead affects the throughput and CPU usage.
$sudo sysctl -w net.ipv4.tcp_timestamps=0
# TCP fin timeout setting
# Default value is 60s, to make faster connection close and making more resources available for new connections less than 10s
$sudo sysctl -w net.ipv4.tcp_fin_timeout=5

# disabling TCP SACK (TCP Selective Acknowledgemen)
$sudo sysctl -w net.ipv4.tcp_sack=0

inaddtion to SACK, Nagle is a TCP/IP feature and works by grouping small outgoing messages into a bigger segment. Usually Nagle algorithm is enabled on standard sockets and it can be disabled by set TCP_NODELAY on the socket.


Som important to consider factors
Parameter Description Recommended value for 10G
net.ipv4.tcp_low_latency The default value is 0 (off). For workloads or environments where latency is a higher priority 1
net.ipv4.tcp_window_scaling Enable window scaling to support larger TCP window sizes for improved throughput 1
net.ipv4.tcp_max_tw_buckets The default value is 262,144. When network demands are high and the environment is less exposed to external threats 450,000
net.ipv4.tcp_timestamps Enable TCP timestamps for more accurate RTT (Round Trip Time) measurement and improved congestion control 1
net.ipv4.tcp_sack Enable Selective Acknowledgments (SACK) to improve TCP performance in the presence of packet loss 1
net.ipv4.tcp_no_metrics_save Disable TCP metrics saving to prevent the system from saving and restoring TCP metrics across reboots 1
net.ipv4.tcp_max_syn_backlog Increase the maximum number of pending connections that can be waiting to be accepted 4096 or higher
net.core.rmem_max and net.core.wmem_max Increase the maximum socket receive and send buffer sizes 1048576 (1GB) or higher.
net.core.netdev_max_backlog Increase the maximum backlog size for incoming network packets 30000 or higher

TCP memory

All of following value should be changed so that the maximum size is bigger than the BDP, otherwise packets can be dropped because of buffer overflow.

net.ipv4.tcp_rmem (the size in bytes of receive buffer used by TCP sockets)

net.ipv4.tcp_wmem (the amount of memory in bytes reserved for sendbuffers)

net.ipv4.tcp_mem ( total TCP buffer-space allocatable in units of page)


others are also considerable parameter to increase performance, kernel ip-sysctl.txt[2] describes all these value

net.core.rmem_max, net.core.

wmem_max, net.core.rmem_default, net.

core.wmem_default, and net.core.optmem_max.

net.core.netdev_max_backlog

Netperf[3] Performance test

Netperf supports a vast number of testcases, but only three are widely used, they are TCP_STREAM, TCP_MAERTS7 and UDP_STREAM. The difference between TCP_STREAM and TCP_MAERTS, is that on first, the traffic flows from the client to the server, and on TCP_MAERTS, the traffic flows from the server to the client. Thus running both TCP_STREAM and TCP_MAERTS in parallel generate a full-duplex test.

#Netperf running for 10 seconds
$netperf -t TCP_STREAM -H 192.168.1.6 -l 10

#Netperf running with TCP_RR
$netperf -H 192.168.1.6 -t TCP_RR -l 10

#with options -c and -C options to enable CPU utilization reporting and shows the asymmetry in CPU loading.
$netperf -T0,0 -C -c


Script to run multiple tests with Netperf

#!/bin/bash
NUMBER=8
TMPFILE=mktemp
DURATION=10
PEER=192.168.1.6
pids=""
echo '' > $TMPFILE

for i in $(seq $NUMBER)
do
    echo "Start test ${i}" 
    netperf -H $PEER -l 10 -t TCP_STREAM -- >> $TMPFILE &
    pids="$pids $!"
    netperf -H $PEER -l 10 -t TCP_MAERTS -- >> $TMPFILE &
    pids="$pids $!"
done

wait $pids

echo -n "Total throughput (10^6bits/sec) result: "
cat $TMPFILE | awk '{sum += $5} END{print sum}'
  • transactions performance, the test option can be TCP_RR or UDP_RR
  • for the request performance test case, −− −h should be appended in the end of the Netperf line, like netperf -t TCP_RR −− −h

Pktgen[4]

Pktgen can generate high bandwidth traffic without burdening the CPUspecially because it is a kernel space application. so Pktgen used to test the transmission flow of network device driver or to generate ordinary packets to test other network devices like router or bridge.

DPDK document shows details how to test Pktgen

Mpstat

Mpstat monitors SMP CPUs usage and it is a very helpful tool to discover if a CPU is overloaded and which process is burdening each CPU during a network performance test. it also display statistics with the amount of IRQs that were raised in a time frame and which CPU handled them (The IRQ affinity above)

TCP Congestion protocols

On Linux it is possible to change the congestion avoidance algorithm on the fly, without even reboot.

Among default algorithm in Linux, IBM recommends cubic algorithm instead of default reno.[5] Other post-install module could be better suited for specific workload environment. For a more comprehensive list of algorithms which may be available for the Linux distribution being used, see:

https://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm#Algorithms
#To lists all the algorithms available to the system
$ cat /proc/sys/net/ipv4/tcp_available_congestion_control
reno cubic

#To see see what algorithm the system is currently using
$ cat /proc/sys/net/ipv4/tcp_congestion_control
cubic

#To change algorithm
$sudo sh -c "echo cubic > /proc/sys/net/ipv4/tcp_congestion_control"

RENO

CUBIC

FAST

References