Optimize TensorFlow

From HPCWIKI
Revision as of 11:06, 24 March 2023 by Admin (talk | contribs) (새 문서: Do you wondering how much of a difference those instructions end up making on your machine between genera pip packages and custom optimization on the system? Because pre-built pip packages do not enable all machine capable optimization flags in it or may not perfectly set for your machine, for example, GCC compiler optimization flgags like <code>gcc -O<number></code> (O the letter, not the number). * -O0: Turns off optimization entirely. Fast compile times, g...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Do you wondering how much of a difference those instructions end up making on your machine between genera pip packages and custom optimization on the system?

Because pre-built pip packages do not enable all machine capable optimization flags in it or may not perfectly set for your machine, for example, GCC compiler optimization flgags like

gcc -O<number> (O the letter, not the number).

  • -O0: Turns off optimization entirely. Fast compile times, good for debugging. This is the default if you don't specify any which you probably don't want.
  • -O1: Basic optimization level.
  • -O2: Recommended for most things. SSE / AVX may be used, but not fully.
  • -O3: Highest optimization possible. Also vectorizes loops, can use all AVX registers.
  • -Os: Small size. Basically enables -O2 options which do not increase size. Can be useful for machines that have limited storage and/or CPUs with small cache sizes.


If you are building TensorFlow on the same machine that will be running it you can just use -O3 -march=native. If you are building on a different machine, you can use the command below to see which flags GCC would set and then pass them when configuring TF.

# This is the output on my machine:
$ gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
/usr/libexec/gcc/x86_64-pc-linux-gnu/7.3.0/cc1 -E -quiet -v - -march=znver1
-mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -msse4a -mcx16 -msahf -mmovbe
-maes -msha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi
-mno-sgx -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm
-mno-hle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave
-mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf
-mno-prefetchwt1 -mclflushopt -mxsavec -mxsaves -mno-avx512dq -mno-avx512bw
-mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps
-mno-avx5124vnniw -mno-clwb -mmwaitx -mclzero -mno-pku -mno-rdpid --param
l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=512
-mtune=znver1