DeepSpeed
The base conda
environment on Polaris comes with Microsoft's DeepSpeed pre-installed. Instructions for using/cloning the base environment can be found here.
A batch submission script for the following example is available here.
We describe below the steps needed to get started with DeepSpeed on Polaris.
We focus on the cifar
example provided in the DeepSpeedExamples repository, though this approach should be generally applicable for running any model with DeepSpeed support.
Running DeepSpeed on Polaris
Note
The instructions below should be run directly from a compute node.
Explicitly, to request an interactive job (from polaris-login
):
Refer to job scheduling and execution for additional information.
-
Load
conda
module and activate base environment: -
Clone microsoft/DeepSpeedExamples and navigate into the directory:
Launching DeepSpeed
-
Get the total number of available GPUs:
- Count the number of lines in
$PBS_NODEFILE
(1 host per line) - Count the number of GPUs available on the current host
NGPUS="$((${NHOSTS}*${NGPU_PER_HOST}))"
- Count the number of lines in
-
Launch with
mpiexec
:
-
Create a DeepSpeed compliant
hostfile
, specifying thehostname
and number of GPUs (slots
) for each of our available workers: -
Create a
.deepspeed_env
containing the environment variables our workers will need access to:
Warning
The .deepspeed_env
file expects each line to be of the form KEY=VALUE
. Each of these will then be set as environment variables on each available worker specified in our hostfile
.
We can then run the cifar10_deepspeed.py
module using DeepSpeed:
AssertionError: Micro batch size per gpu: 0 has to be greater than 0
Depending on the details of your specific job, it may be necessary to modify the provided ds_config.json
.
If you encounter an error:
you can modify the"train_batch_size": 16
variable in the provided ds_config.json
to the (total) number of available GPUs, and explicitly set "gradient_accumulation_steps": 1
, as shown below.
$ export NHOSTS=$(wc -l < "${PBS_NODEFILE}")
$ export NGPU_PER_HOST=$(nvidia-smi -L | wc -l)
$ export NGPUS="$((${NHOSTS}*${NGPU_PER_HOST}))"
$ echo $NHOSTS $NGPU_PER_HOST $NGPUS
24 4 96
$ # replace "train_batch_size" with $NGPUS in ds_config.json
$ # and write to `ds_config-polaris.json`
$ sed \
"s/$(cat ds_config.json| grep batch | cut -d ':' -f 2)/ ${NGPUS},/" \
ds_config.json \
> ds_config-polaris.json
$ cat ds_config-polaris.json
{
"train_batch_size": 96,
"gradient_accumulation_steps": 1,
...
}