PyTorch on Aurora
PyTorch is a popular, open-source deep learning framework developed and released by Facebook. The PyTorch home page, has more information about PyTorch, which you can refer to. For troubleshooting on Aurora, please contact [email protected].
Provided Installation
PyTorch is already installed on Aurora with GPU support and available through the frameworks module. To use it from a compute node, please load the following modules:
Then you canimport
PyTorch as usual, the following is an output from the
frameworks
module
A simple but useful check could be to use PyTorch to get device information on
a compute node. You can do this the following way:
Output of the above code block:
GPU availability: True
Number of tiles = 12
Current tile = 0
Curent device ID = <intel_extension_for_pytorch.xpu.device object at 0x1540a9f25790>
Device properties = _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', type='gpu', driver_version='1.3.30872', total_memory=65536MB, max_compute_units=448, gpu_eu_count=448, gpu_subslice_count=56, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
torch
module, you need to import the
intel_extension_for_pytorch
module. The default mode in ipex
for counting
the available devices on a compute node treat each tile (also called "Sub-device") as a torch device, hence the
code block above is expected to output 12
. If you want to get the number of
GPUs (also called "Devices" or "cards") as an output, you may declare the following environment variable:
With this environmental variable, we expect the output to be 6
-- the number
of GPUs available on an Aurora compute node. All the API
calls involving
torch.cuda
, should be replaced with torch.xpu
, as shown in the above
example.
Tip
It is highly recommended to import intel_extension_for_pytorch
right after import torch
, prior to importing other packages, (from Intel's "Getting Started" doc).
Intel extension for PyTorch has been made publicly available as an open-source project on GitHub.
Please consult the following resources for additional details and useful tutorials: - PyTorch's webpage for Intel extension - Intel's IPEX GitHub repository - Intel's IPEX Documentation
PyTorch Best Practices on Aurora
Single Device Performance
By default, each tile is mapped to one PyTorch device, giving a total of 12 devices per node, as seen above. To map a PyTorch device to one particular GPU Device out of the 6 available on a compute node, these environmental variables should be set
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0
# or, equivalently, following the syntax `Device.Sub-device`
export ZE_AFFINITY_MASK=0.0,0.1
Device:0
and Sub-devices: 0, 1
, i.e. the two tiles of the GPU:0. This is
particularly important in setting a performance benchmarking baseline.
Setting the above environmental variables after loading the frameworks modules,
you can check that each PyTorch device is now mapped to one GPU:
Example output
1
_XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', type='gpu', driver_version='1.3.30872', total_memory=131072MB, max_compute_units=896, gpu_eu_count=896, gpu_subslice_count=112, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
More information and details are available through the Level Zero Specification Documentation - Affinity Mask
Single Node Performance
When running PyTorch applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.
-
Use Reduced Precision. Reduced Precision is available on Intel Max 1550 and is supported with PyTorch operations. In general, the way to do this is via the PyTorch Automatic Mixed Precision package (AMP), as descibed in the mixed precision documentation. In PyTorch, users generally need to manage casting and loss scaling manually, though context managers and function decorators can provide easy tools to do this.
-
PyTorch has a
JIT
module as well as backends to support op fusion, similar to TensorFlow'stf.function
tools. Please see TorchScript for more information. -
torch.compile
will be available through the next framework release.
Multi-GPU / Multi-Node Scale Up
PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good performance with PyTorch has been seen with both Distributed Data Parallel (DDP) and Horovod. For details, please see the Distributed Data Parallel documentation or the Horovod documentation. Some of the Aurora specific details might be helpful to you:
Environmental Variables
The following environmental variables should be set on the batch submission script (PBSPro script) in the case of attempting to run beyond 16 nodes.
oneCCL environment variables
We have identified a set of environment settings that typically provides better performance or addresses potential application hangs and crashes at large scale. This particular setup is still experimental, and it might change as the environment variable settings are refined. The users are encouraged to check this page regularly.
export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE_SCALEOUT=direct:0-1048576;rabenseifner:1048577-max # currently best allreduce algorithm at large scale
export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
export CCL_KVS_MODE=mpi
export CCL_CONFIGURATION_PATH=""
export CCL_CONFIGURATION=cpu_gpu_dpcpp
export CCL_KVS_CONNECTION_TIMEOUT=600
export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1
export MPI_PROVIDER=$FI_PROVIDER
unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
The following additional set of environment variable setup might be application dependent. Users are encourage to try to set them and see whether they help their applications.
ulimit -c unlimited
export FI_MR_ZE_CACHE_MONITOR_ENABLED=0
export FI_MR_CACHE_MONITOR=disabled
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_DEFAULT_CQ_SIZE=1048576
export FI_CXI_CQ_FILL_PERCENT=30
export INTELGT_AUTO_ATTACH_DISABLE=1
export PALS_PING_PERIOD=240
export PALS_RPC_TIMEOUT=240
export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
export CCL_OP_SYNC=1 #to avoid potential hang at large scale
These environment variable settings will probably be included in the framework module file in the future. But for now, users need to explicitly set these in the submission script.
In order to run an application with TF32
precision type, one must set the
following environmental parameter:
TF32
as opposed to the default FP32
, and
done through intel_extension_for_pytorch
module.
CPU Affinity
The CPU affinity should be set manually through mpiexec. You can do this the following way:
export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
mpiexec ... --cpu-bind=${CPU_BIND}
These bindings should be use along with the following oneCCL and Horovod environment variable settings:
HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"
CCL_WORKER_AFFINITY="5,13,21,29,37,45,57,65,73,81,89,97"
When running 12 ranks per node with these settings the framework
s use 3 cores,
with Horovod tightly coupled with the framework
s using one of the 3 cores, and
oneCCL using a separate core for better performance, eg. with rank 0 the
framework
s would use cores 2,3,4, Horovod would use core 4, and oneCCL would
use core 5.
Each workload may perform better with different settings. The criteria for choosing the cpu bindings are:
- Binding for GPU and NIC affinity – To bind the ranks to cores on the proper socket or NUMA nodes.
- Binding for cache access – This is the part that will change per application and some experimentation is needed.
Important: This setup is a work in progress, and based on observed
performance. The recommended settings are likely to changed with new framework
releases.
Distributed Training
Distributed training with PyTorch on Aurora is facilitated through both DDP and
Horovod. DDP training is accelerated using oneAPI Collective Communications
Library Bindings for Pytorch (oneCCL Bindings for Pytorch).
The extension supports FP32 and BF16 data types.
More detailed information and examples are available at the
Intel oneCCL repo, formerly known as
torch-ccl
.
The key steps in performing distributed training using oneccl_bindings_for_pytorch
are the following:
A detailed example of the full procedure with a toy model is given here:
Below we give an example PBS job script:
example_torch_dist_training.sh | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
|