Optimize TensorFlow: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
(새 문서: Do you wondering how much of a difference those instructions end up making on your machine between genera pip packages and custom optimization on the system? Because pre-built pip packages do not enable all machine capable optimization flags in it or may not perfectly set for your machine, for example, GCC compiler optimization flgags like <code>gcc -O<number></code> (O the letter, not the number). * -O0: Turns off optimization entirely. Fast compile times, g...)
 
No edit summary
 
Line 1: Line 1:
== Optimizing target CPU flags ==
Optimize TensorFlow to [[CPU features]] by turning on all the computation [[optimization]] opportunities provided by the CPU.
== Compiler optimization flags ==
Do you wondering how much of a difference those instructions end up making on your machine between genera pip packages and custom optimization on the system?
Do you wondering how much of a difference those instructions end up making on your machine between genera pip packages and custom optimization on the system?


Line 10: Line 14:
* -O3: Highest optimization possible. Also vectorizes loops, can use all AVX  registers.
* -O3: Highest optimization possible. Also vectorizes loops, can use all AVX  registers.
* -Os: Small size. Basically enables -O2 options which do not increase size.  Can be useful for machines that have limited storage and/or CPUs with small  cache sizes.
* -Os: Small size. Basically enables -O2 options which do not increase size.  Can be useful for machines that have limited storage and/or CPUs with small  cache sizes.


If you are building TensorFlow on the same machine that will be running it you can just use <code>-O3 -march=native</code>. If you are building on a different machine, you can use the command below to see which flags GCC would set and then pass them when configuring TF.
If you are building TensorFlow on the same machine that will be running it you can just use <code>-O3 -march=native</code>. If you are building on a different machine, you can use the command below to see which flags GCC would set and then pass them when configuring TF.
Line 27: Line 30:
  l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=512
  l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=512
  -mtune=znver1
  -mtune=znver1
== References ==
<references />

Latest revision as of 11:50, 29 October 2023

Optimizing target CPU flags

Optimize TensorFlow to CPU features by turning on all the computation optimization opportunities provided by the CPU.

Compiler optimization flags

Do you wondering how much of a difference those instructions end up making on your machine between genera pip packages and custom optimization on the system?

Because pre-built pip packages do not enable all machine capable optimization flags in it or may not perfectly set for your machine, for example, GCC compiler optimization flgags like

gcc -O<number> (O the letter, not the number).

  • -O0: Turns off optimization entirely. Fast compile times, good for debugging. This is the default if you don't specify any which you probably don't want.
  • -O1: Basic optimization level.
  • -O2: Recommended for most things. SSE / AVX may be used, but not fully.
  • -O3: Highest optimization possible. Also vectorizes loops, can use all AVX registers.
  • -Os: Small size. Basically enables -O2 options which do not increase size. Can be useful for machines that have limited storage and/or CPUs with small cache sizes.

If you are building TensorFlow on the same machine that will be running it you can just use -O3 -march=native. If you are building on a different machine, you can use the command below to see which flags GCC would set and then pass them when configuring TF.

# This is the output on my machine:
$ gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
/usr/libexec/gcc/x86_64-pc-linux-gnu/7.3.0/cc1 -E -quiet -v - -march=znver1
-mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -msse4a -mcx16 -msahf -mmovbe
-maes -msha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi
-mno-sgx -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm
-mno-hle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave
-mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf
-mno-prefetchwt1 -mclflushopt -mxsavec -mxsaves -mno-avx512dq -mno-avx512bw
-mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps
-mno-avx5124vnniw -mno-clwb -mmwaitx -mclzero -mno-pku -mno-rdpid --param
l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=512
-mtune=znver1

References