Optimize TensorFlow
Optimizing target CPU flags
Optimize TensorFlow to CPU features by turning on all the computation optimization opportunities provided by the CPU.
Compiler optimization flags
Do you wondering how much of a difference those instructions end up making on your machine between genera pip packages and custom optimization on the system?
Because pre-built pip packages do not enable all machine capable optimization flags in it or may not perfectly set for your machine, for example, GCC compiler optimization flgags like
gcc -O<number>
(O the letter, not the number).
- -O0: Turns off optimization entirely. Fast compile times, good for debugging. This is the default if you don't specify any which you probably don't want.
- -O1: Basic optimization level.
- -O2: Recommended for most things. SSE / AVX may be used, but not fully.
- -O3: Highest optimization possible. Also vectorizes loops, can use all AVX registers.
- -Os: Small size. Basically enables -O2 options which do not increase size. Can be useful for machines that have limited storage and/or CPUs with small cache sizes.
If you are building TensorFlow on the same machine that will be running it you can just use -O3 -march=native
. If you are building on a different machine, you can use the command below to see which flags GCC would set and then pass them when configuring TF.
# This is the output on my machine: $ gcc -march=native -E -v - </dev/null 2>&1 | grep cc1 /usr/libexec/gcc/x86_64-pc-linux-gnu/7.3.0/cc1 -E -quiet -v - -march=znver1 -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -msse4a -mcx16 -msahf -mmovbe -maes -msha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mclflushopt -mxsavec -mxsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mmwaitx -mclzero -mno-pku -mno-rdpid --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=512 -mtune=znver1