TensorFlow on Aurora
TensorFlow is a popular, open-source deep learning framework developed and released by Google. The TensorFlow home page has more information about TensorFlow, which you can refer to. For trouble shooting on Polaris, please contact [email protected].
Provided Installation
TensorFlow is already preinstalled on Aurora, available in the frameworks
module. To use it from a compute node, load the module:
Then you can import
TensorFlow as usual, the following is an output from the
frameworks
module:
A simple but useful check could be to use TensorFlow to get device information on a compute node. You can do this the following way:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:1', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:2', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:3', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:4', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:5', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:6', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:7', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:8', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:9', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:10', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:11', device_type='XPU')]
Note that, here tf.config
return 12 tiles of 6 cards (the number of GPU
resources on an Aurora compute node), and treat each tile as a device. The user
can choose to set the environmental variable ZE_FLAT_DEVICE_HIERARCHY
with
appropriate values to achieve desired behavior, as described in the
Level Zero Specification documentation.
This environment variable is equivalent to the ITEX_TILE_AS_DEVICE
, which is
to be deprecated soon.
Intel extension for TensorFLow is has been made publicly available as an open-source project at GitHub.
Please consult the following resources for additional details and useful tutorials:
TensorFlow Best Practices on Aurora
Single Device Performance
To expose one particular device out of the 6 available on a compute node, this environmental variable should be set
export ZE_AFFINITY_MASK=0.0,0.1
# The values taken by this variable follows the syntax `Device.Sub-device`
More information and details are available through Level Zero Specification Documentation - Affinity Mask
Single Node Performance
When running TensorFlow applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.
Reduced Precision
Use Reduced Precision, whenever the application allows. Reduced Precision is
available on Intel Max 1550 and is supported with TensorFlow operations. In
general, the way to do this is via the tf.keras.mixed_precision
Policy, as
descibed in the
mixed precision documentation
Intel's extension for TensorFlow is fully compatible with the Keras mixed
precision API in TensorFlow. It also provides an advanced auto mixed precision
feature. For example, you can just set two environment variables to get the
performance benefit from low-precision data type FP16
/BF16
without changing the
application code.
export ITEX_AUTO_MIXED_PRECISION=1
export ITEX_AUTO_MIXED_PRECISION_DATA_TYPE="BFLOAT16" # or "FLOAT16"
keras.Model.fit
), you will also
need to apply loss scaling.
TensorFlow's graph API
Use TensorFlow's graph API to improve efficiency of operations. TensorFlow is,
in general, an imperative language but with function decorators like
@tf.function
you can trace functions in your code. Tracing replaces your
python function with a lower-level, semi-compiled TensorFlow Graph. More
information about the tf.function
interface is available
here.
When possible, use jit_compile
, but be aware of sharp bits when using
tf.function
: python expressions that aren't tensors are often replaced as
constants in the graph, which may or may not be your intention.
There is an experimental feature, which allows for aggressive fusion of kernels through oneDNN Graph API. Intel's extension for TensorFlow can offload performance critical graph partitions to oneDNN library to get more aggressive graph optimizations. It can be done by setting this environmental variable:
This feature is experimental, and actively under development.TF32
Math Mode
The Intel Xe Matrix Extensions (Intel XMX) engines in Intel Max 1550 Xe-HPC
GPUs natively support TF32
math mode. Through intel extension for tensorflow
you can enable it by setting the following environmental variable:
XLA Compilation (Planned/Upcoming)
XLA is the Accelerated Linear Algebra library that is available in TensorFlow
and critical in software like JAX. XLA will compile a tf.Graph
object,
generated with tf.function
or similar, and perform optimizations like
operation-fusion. XLA can give impressive performance boosts with almost no
user changes except to set an environment variable TF_XLA_FLAGS=--tf_xla_auto_jit=2
.
If your code is complex, or has dynamically sized tensors (tensors where the
shape changes every iteration), XLA can be detrimental: the overhead for
compiling functions can be large enough to mitigate performance improvements.
XLA is particularly powerful when combined with reduced precision,
yielding speedups > 100% in some models.
Intel provides initial intel GPU support for TensorFlow models with XLA acceleration through Intel Extension for OpenXLA. Full TensorFlow and PyTorch support is planned for development.
A simple example
A simple example on how to use Intel GPU with TensorFlow is the following:
Multi-GPU / Multi-Node Scale Up
TensorFlow is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good performance with tensorFlow has been seen with horovod in particular. For details, please see the Horovod documentation. Some Aurora specific details might be helpful to you.
Environment Variables
The following environmental variables should be set on the batch submission script (PBSPro script) in the case of attempting to run beyond 16 nodes.
oneCCL environment variables
We have identified a set of environment settings that typically provides better performance or addresses potential application hangs and crashes at large scale. This particular setup is still experimental, and it might change as the environment variable settings are refined. The users are encouraged to check this page regularly.
export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE_SCALEOUT=direct:0-1048576;rabenseifner:1048577-max # currently best allreduce algorithm at large scale
export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
export CCL_KVS_MODE=mpi
export CCL_CONFIGURATION_PATH=""
export CCL_CONFIGURATION=cpu_gpu_dpcpp
export CCL_KVS_CONNECTION_TIMEOUT=600
export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1
export MPI_PROVIDER=$FI_PROVIDER
unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
The following additional set of environment variable setup might be application dependent. Users are encourage to try to set them and see whether they help their applications.
ulimit -c unlimited
export FI_MR_ZE_CACHE_MONITOR_ENABLED=0
export FI_MR_CACHE_MONITOR=disabled
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_DEFAULT_CQ_SIZE=1048576
export FI_CXI_CQ_FILL_PERCENT=30
export INTELGT_AUTO_ATTACH_DISABLE=1
export PALS_PING_PERIOD=240
export PALS_RPC_TIMEOUT=240
export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
export CCL_OP_SYNC=1 #to avoid potential hang at large scale
These environment variable settings will probably be included in the framework module file in the future. But for now, users need to explicitly set these in the submission script.
CPU Affinity
The CPU affinity should be set manually through mpiexec. You can do this the following way:
export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
mpiexec ... --cpu-bind=${CPU_BIND}
These bindings should be use along with the following oneCCL and Horovod environment variable settings:
HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"
CCL_WORKER_AFFINITY="5,13,21,29,37,45,57,65,73,81,89,97"
When running 12 ranks per node with these settings the framework
s use 3 cores,
with Horovod tightly coupled with the framework
s using one of the 3 cores, and
oneCCL using a separate core for better performance, eg. with rank 0 the
framework
s would use cores 2,3,4, Horovod would use core 4, and oneCCL would
use core 5.
Each workload may perform better with different settings. The criteria for choosing the cpu bindings are:
- Binding for GPU and NIC affinity – To bind the ranks to cores on the proper socket or NUMA nodes.
- Binding for cache access – This is the part that will change per application and some experimentation is needed.
Note
This setup is a work in progress, and based on observed performance. The recommended settings are likely to changed with new framework
releases.
Distributed Training
Distributed training with TensorFlow on Aurora is facilitated through Horovod, using Intel Optimization for Horovod.
The key steps in performing distributed training are laid out in the following example:
Detailed implementation of the same example is here:
A suite of detailed and well documented examples is part of Intel's optimization for Horovod repository:
A simple Job Script
Below we give a simple job script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
|