TensorFlow on Polaris

TensorFlow is a popular, open-source deep learning framework developed and released by Google. The TensorFlow home page has more information about TensorFlow, which you can refer to. For trouble shooting on Polaris, please contact [email protected].

Installation on Polaris

TensorFlow is already pre-installed on Polaris, available in the conda module. To use it from a compute node, please do:

module load conda
conda activate

Then, you can load TensorFlow in python as usual (below showing results from the conda/2024-04-29 module):

>>> import tensorflow as tf
>>> tf.__version__
'2.16.1'
>>>

This installation of TensorFlow was built from source and the CUDA libraries it uses are found via the CUDA_HOME environment variable (below showing results from the conda/2024-04-29 module):

$ echo $CUDA_HOME
/soft/compilers/cudatoolkit/cuda-12.4.1/

If you need to build applications that use this version of TensorFlow and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the TensorFlow release, though updates will come in the form of new versions of the conda module.

TensorFlow is also available through NVIDIA containers that have been translated to Apptainer containers. For more information about containers, please see the Containers documentation page.

TensorFlow Best Practices on Polaris

Single Node Performance

When running TensorFlow applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.

Use Reduced Precision. Reduced Precision is available on A100 via tensorcores and is supported with TensorFlow operations. In general, the way to do this is via the tf.keras.mixed_precision Policy, as descibed in the mixed precision documentation. If you use a custom training loop (and not keras.Model.fit), you will also need to apply loss scaling.
Use TensorFlow's graph API to improve efficiency of operations. TensorFlow is, in general, an imperative language but with function decorators like @tf.function you can trace functions in your code. Tracing replaces your python function with a lower-level, semi-compiled TensorFlow Graph. More information about the tf.function interface is available here. When possible, use jit_compile, but be aware of sharp bits when using tf.function: python expressions that aren't tensors are often replaced as constants in the graph, which may or may not be your intention.
Use XLA compilation on your code. XLA is the Accelerated Linear Algebra library that is available in tensorFlow and critical in software like JAX. XLA will compile a tf.Graph object, generated with tf.function or similar, and perform optimizations like operation-fusion. XLA can give impressive performance boosts with almost no user changes except to set an environment variable TF_XLA_FLAGS=--tf_xla_auto_jit=2. If your code is complex, or has dynamically sized tensors (tensors where the shape changes every iteration), XLA can be detrimental: the overhead for compiling functions can be large enough to mitigate performance improvements. XLA is particularly powerful when combined with reduced precision, yielding speedups > 100% in some models.

Multi-GPU / Multi-Node Scale up

TensorFlow is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs. Good performance with tensorFlow has been seen with horovod in particular. For details, please see the Horovod documentation. Some polaris specific details that may be helpful to you:

CPU affinity can improve performance, particularly for data loading process. In particular, we encourage users to try their scaling measurements by manually setting the CPU affinity via mpiexec, such as with --cpu-bind verbose,list:0,8,16,24 or --cpu-bind depth -d 16.
NCCL settings: We have done extensive performance tests and identified the following best environment setup.

export NCCL_NET_GDR_LEVEL=PHB
export NCCL_CROSS_NIC=1
export NCCL_COLLNET_ENABLE=1
export NCCL_NET="AWS Libfabric"
export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072

The key here is to enable AWS plugin (https://github.com/aws/aws-ofi-nccl). AWS OFI NCCL is a plugin which enables EC2 developers to use libfabric as a network provider while running NVIDIA's NCCL based applications.

This setup can lead to 2-3x performance improvement for some communication workloads. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests.

Warning

For some applications such as Megatron-DeepSpeed, enabling AWS plugin will cause hang or NCCL timeout issue. If so, please disable it by:

unset NCCL_NET_GDR_LEVEL NCCL_CROSS_NIC NCCL_COLLNET_ENABLE NCCL_NET

CUDA device setting: it works best when you limit the visible devices to only one GPU. Note that if you import mpi4py or horovod, and then do something like os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank(), it may not actually work! You must set the CUDA_VISIBLE_DEVICES environment variable prior to doing MPI.COMM_WORLD.init(), which is done in horovod.init() as well as implicitly in from mpi4py import MPI. On Polaris specifically, you can use the environment variable PMI_LOCAL_RANK (as well as PMI_LOCAL_SIZE) to learn information about the node-local MPI ranks.

TensorFlow Dataloaders

It is crucial to enable multiple workers in the data pipeline for best performance. For details, please refer to https://www.tensorflow.org/guide/data_performance