PyTorch on Polaris
PyTorch is a popular, open source deep learning framework developed and released by Facebook. The PyTorch home page has more information about PyTorch, which you can refer to. For trouble shooting on Polaris, please contact [email protected].
Installation on Polaris
PyTorch is installed on Polaris already, available in the conda
module. To use it from a compute node, please do:
Then, you can load PyTorch in python
as usual (below showing results from the conda/2024-04-29
module):
This installation of PyTorch was built from source and the cuda libraries it uses are found via the CUDA_HOME
environment variable (below showing results from the conda/2024-04-29
module):
If you need to build applications that use this version of PyTorch and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the PyTorch release, though updates will come in the form of new versions of the conda
module.
PyTorch is also available through NVIDIA containers that have been translated to Apptainer containers. For more information about containers, please see the containers documentation page.
PyTorch Best Practices on Polaris
Single Node Performance
When running PyTorch applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.
-
Use Reduced Precision. Reduced Precision is available on A100 via tensorcores and is supported with PyTorch operations. In general, the way to do this is via the PyTorch Automatic Mixed Precision package (AMP), as described in the mixed precision documentation. In PyTorch, users generally need to manage casting and loss scaling manually, though context managers and function decorators can provide easy tools to do this.
-
PyTorch has a
JIT
module as well as backends to support op fusion, similar to TensorFlow'stf.function
tools. However, PyTorch JIT capabilities are newer and may not yield performance improvements. Please see TorchScript for more information.
Multi-GPU / Multi-Node Scale up
PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs. Good performance with PyTorch has been seen with both DDP and Horovod. For details, please see the Horovod documentation or the Distributed Data Parallel documentation. Some Polaris-specific details that may be helpful to you:
-
CPU affinity can improve performance, particularly for data loading process. In particular, we encourage users to try their scaling measurements by manually setting the CPU affinity via mpiexec, such as with
--cpu-bind verbose,list:0,8,16,24
or--cpu-bind depth -d 16
. -
NCCL settings: We have done extensive performance tests and identified the following best environment setup.
export NCCL_NET_GDR_LEVEL=PHB
export NCCL_CROSS_NIC=1
export NCCL_COLLNET_ENABLE=1
export NCCL_NET="AWS Libfabric"
export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
This setup can lead to 2-3x performance improvement for some communication workloads. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests.
Warning
For some applications such as Megatron-DeepSpeed, enabling AWS plugin will cause hang or NCCL timeout issue. If so, please disable it by:
- CUDA device setting: it works best when you limit the visible devices to only one GPU. Note that if you import
mpi4py
orhorovod
, and then do something likeos.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()
, it may not actually work! You must set theCUDA_VISIBLE_DEVICES
environment variable prior to doingMPI.COMM_WORLD.init()
, which is done inhorovod.init()
as well as implicitly infrom mpi4py import MPI
. On Polaris specifically, you can use the environment variablePMI_LOCAL_RANK
(as well asPMI_LOCAL_SIZE
) to learn information about the node-local MPI ranks.
DeepSpeed
DeepSpeed is also available and usable on Polaris. For more information, please see the DeepSpeed documentation directly.
PyTorch DataLoader
and multi-node Horovod
For best performance, it is crucial to enable multiple workers in the data loader to avoid compute and I/O overlap and concurrent loading of dataset. This can be set by tunning "num_workers" parameter in DataLoader
(see https://pytorch.org/docs/stable/data.html). Accordingly to our experience, generally, one can set 4 or 8 for best performance. Due to the total number of CPU cores available on a node, the maximum number of workers one can choose is 16. It is always to tune this value and find the optimal setup for your own application.
Aside from this, one also have to make sure that the worker threads spread over different CPU codes. To do this one has to specify the CPU binding to be depth
and choose a depth value larger than num_workers
through the following flag in the mpiexec
command:
Before 2024, enabling multiple workers would cause a fatal hang, but this has been addressed after an OS upgrade on Polaris.