PyTorch on Polaris
PyTorch is a popular, open source deep learning framework developed and released by Facebook. The PyTorch home page has more information about PyTorch, which you can refer to. For trouble shooting on Polaris, please contact [email protected].
Installation on Polaris
PyTorch is installed on Polaris already, available in the conda
module. To use it from a compute node, please do:
Then, you can load PyTorch in python
as usual (below showing results from the conda/2022-07-19
module):
This installation of PyTorch was built from source and the cuda libraries it uses are found via the CUDA_HOME
environment variable (below showing results from the conda/2022-07-19
module):
If you need to build applications that use this version of PyTorch and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the PyTorch release, though updates will come in the form of new versions of the conda
module.
PyTorch is also available through NVIDIA containers that have been translated to Apptainer containers. For more information about containers, please see the containers documentation page.
PyTorch Best Practices on Polaris
Single Node Performance
When running PyTorch applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.
-
Use Reduced Precision. Reduced Precision is available on A100 via tensorcores and is supported with PyTorch operations. In general, the way to do this is via the PyTorch Automatic Mixed Precision package (AMP), as descibed in the mixed precision documentation. In PyTorch, users generally need to manage casting and loss scaling manually, though context managers and function decorators can provide easy tools to do this.
-
PyTorch has a
JIT
module as well as backends to support op fusion, similar to TensorFlow'stf.function
tools. However, PyTorch JIT capabilities are newer and may not yield performance improvements. Please see TorchScript for more information.
Multi-GPU / Multi-Node Scale up
PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs. Good performance with PyTorch has been seen with both DDP and Horovod. For details, please see the Horovod documentation or the Distributed Data Parallel documentation. Some Polaris-specific details that may be helpful to you:
- CPU affinity and NCCL settings can improve scaling performance, particularly at the largest scales. In particular, we encourage users to try their scaling measurements with the following settings:
- Set the environment variable
NCCL_COLLNET_ENABLE=1
- Set the environment varialbe
NCCL_NET_GDR_LEVEL=PHB
-
Manually set the CPU affinity via mpiexec, such as with
--cpu-bind verbose,list:0,8,16,24
-
Horovod and DDP work best when you limit the visible devices to only one GPU. Note that if you import
mpi4py
orhorovod
, and then do something likeos.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()
, it may not actually work! You must set theCUDA_VISIBLE_DEVICES
environment variable prior to doingMPI.COMM_WORLD.init()
, which is done inhorovod.init()
as well as implicitly infrom mpi4py import MPI
. On Polaris specifically, you can use the environment variablePMI_LOCAL_RANK
(as well asPMI_LOCAL_SIZE
) to learn information about the node-local MPI ranks.
DeepSpeed
DeepSpeed is also available and usable on Polaris. For more information, please see the DeepSpeed documentation directly.
PyTorch DataLoader
and multi-node Horovod
Please note there is a bug that causes a hang when using PyTorch's multithreaded data loaders with distributed training across multiple nodes. To workaround this, NVIDIA recommends setting num_workers=0
in the dataloader configuration, which serializes data loading.
For more details, see Polaris Known Issues.