TensorFlow on Polaris
TensorFlow is a popular, open-source deep learning framework developed and released by Google. The TensorFlow home page has more information about TensorFlow, which you can refer to. For trouble shooting on Polaris, please contact [email protected].
Installation on Polaris
TensorFlow is already pre-installed on Polaris, available in the conda
module. To use it from a compute node, please do:
Then, you can load TensorFlow in python
as usual (below showing results from the conda/2024-04-29
module):
This installation of TensorFlow was built from source and the CUDA libraries it uses are found via the CUDA_HOME
environment variable (below showing results from the conda/2024-04-29
module):
If you need to build applications that use this version of TensorFlow and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the TensorFlow release, though updates will come in the form of new versions of the conda
module.
TensorFlow is also available through NVIDIA containers that have been translated to Apptainer containers. For more information about containers, please see the Containers documentation page.
TensorFlow Best Practices on Polaris
Single Node Performance
When running TensorFlow applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.
-
Use Reduced Precision. Reduced Precision is available on A100 via tensorcores and is supported with TensorFlow operations. In general, the way to do this is via the
tf.keras.mixed_precision
Policy, as descibed in the mixed precision documentation. If you use a custom training loop (and notkeras.Model.fit
), you will also need to apply loss scaling. -
Use TensorFlow's graph API to improve efficiency of operations. TensorFlow is, in general, an imperative language but with function decorators like
@tf.function
you can trace functions in your code. Tracing replaces your python function with a lower-level, semi-compiled TensorFlow Graph. More information about thetf.function
interface is available here. When possible, use jit_compile, but be aware of sharp bits when usingtf.function
: python expressions that aren't tensors are often replaced as constants in the graph, which may or may not be your intention. -
Use XLA compilation on your code. XLA is the Accelerated Linear Algebra library that is available in tensorFlow and critical in software like JAX. XLA will compile a
tf.Graph
object, generated withtf.function
or similar, and perform optimizations like operation-fusion. XLA can give impressive performance boosts with almost no user changes except to set an environment variableTF_XLA_FLAGS=--tf_xla_auto_jit=2
. If your code is complex, or has dynamically sized tensors (tensors where the shape changes every iteration), XLA can be detrimental: the overhead for compiling functions can be large enough to mitigate performance improvements. XLA is particularly powerful when combined with reduced precision, yielding speedups > 100% in some models.
Multi-GPU / Multi-Node Scale up
TensorFlow is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs. Good performance with tensorFlow has been seen with horovod in particular. For details, please see the Horovod documentation. Some polaris specific details that may be helpful to you:
-
CPU affinity can improve performance, particularly for data loading process. In particular, we encourage users to try their scaling measurements by manually setting the CPU affinity via mpiexec, such as with
--cpu-bind verbose,list:0,8,16,24
or--cpu-bind depth -d 16
. -
NCCL settings: We have done extensive performance tests and identified the following best environment setup.
export NCCL_NET_GDR_LEVEL=PHB
export NCCL_CROSS_NIC=1
export NCCL_COLLNET_ENABLE=1
export NCCL_NET="AWS Libfabric"
export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
This setup can lead to 2-3x performance improvement for some communication workloads. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests.
Warning
For some applications such as Megatron-DeepSpeed, enabling AWS plugin will cause hang or NCCL timeout issue. If so, please disable it by:
- CUDA device setting: it works best when you limit the visible devices to only one GPU. Note that if you import
mpi4py
orhorovod
, and then do something likeos.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()
, it may not actually work! You must set theCUDA_VISIBLE_DEVICES
environment variable prior to doingMPI.COMM_WORLD.init()
, which is done inhorovod.init()
as well as implicitly infrom mpi4py import MPI
. On Polaris specifically, you can use the environment variablePMI_LOCAL_RANK
(as well asPMI_LOCAL_SIZE
) to learn information about the node-local MPI ranks.
TensorFlow Dataloaders
It is crucial to enable multiple workers in the data pipeline for best performance. For details, please refer to https://www.tensorflow.org/guide/data_performance