Skip to content

Argonne Leadership Computing Facility


The base conda environment on ThetaGPU comes with Microsoft's DeepSpeed pre-installed. Instructions for using / cloning the base environment can be found here.

We describe below the steps needed to get started with DeepSpeed on ThetaGPU.

We focus on the cifar example provided in the DeepSpeedExamples repository, though this approach should be generally applicable for running any model with DeepSpeed support.

Running DeepSpeed on ThetaGPU


The instructions below should be ran directly from a compute node. Explicitly, to request an interactive job (from thetalogin):

qsub-gpu -A <project> -n 2 -t 01:00 -q full-node \
    --attrs="filesystems=home,grand,eagle,theta-fs0:ssds=required" \

Refer to GPU Node Queue and Policy.

  1. Load conda module and activate base environment:

    module load conda ; conda activate base

  2. Clone microsoft/DeepSpeedExamples and navigate into the directory:

    git clone
    cd DeepSpeedExamples/cifar

  3. Our newer conda environments should come with DeepSpeed pre-installed, but in the event your environment has no deepspeed, it can be installed2 with pip:

    $ which deepspeed
    deepspeed not found
    $ python3 -m pip install --upgrade pip setuptools wheel
    $ DS_BUILD_OPS=1 python3 -m pip install 

Launching DeepSpeed

  1. Get total number of available GPUs:

    1. Count number of lines in $COBALT_NODEFILE (1 host per line)
    2. Count number of GPUs available on current host
    3. NGPUS = $((${NHOSTS}*${NGPU_PER_HOST}))
      NHOSTS=$(wc -l < "${COBALT_NODEFILE}")
      NGPU_PER_HOST=$(nvidia-smi -L | wc -l)
  2. Launch with mpirun1:

    mpirun \
        -n "${NGPUS}" \
        -npernode "${NGPU_PER_HOST}" \
        --hostfile "${COBALT_NODEFILE}" \
        -x PATH \
        -x LD_LIBRARY_PATH \
        -x http_proxy \
        -x https_proxy
        python3 \
            --deepspeed_config ds_config-1.json

  1. Create a DeepSpeed compliant hostfile, specifying the hostname and number of GPUs (slots) for each of our available workers:

    cat $COBALT_NODEFILE > hostfile
    sed -e 's/$/ slots=4/' -i hostfile

  2. Create a .deepspeed_env containing the environment variables our workers will need access to:

    echo "PATH=${PATH}" >> .deepspeed_env
    echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> .deepspeed_env
    echo "http_proxy=${http_proxy}" >> .deepspeed_env
    echo "https_proxy=${https_proxy}" >> .deepspeed_env


The .deepspeed_env file expects each line to be of the form KEY=VALUE. Each of these will then be set as environment variables on each available worker specified in our hostfile.

We can then run the module using DeepSpeed:

Launch with DeepSpeed
deepspeed --hostfile=hostfile \
    --deepspeed \
    --deepspeed_config ds_config.json

AssertionError: Micro batch sizer per gpu: 0 has to be greater than 0

Depending on the details of your specific job, it may be necessary to modify the provided ds_config.json.

If you encounter an error:

thetagpu23: AssertionError: Micro batch size per gpu: 0 has to be greater than 0
you can modify the "train_batch_size": 16 variable in the provided ds_config.json to the (total) number of available GPUs, and explicitly set "gradient_accumulation_steps": 1, as shown below.
$ export NHOSTS=$(wc -l < "${COBALT_NODEFILE}")
$ export NGPU_PER_HOST=$(nvidia-smi -L | wc -l)
$ export NGPUS="$((${NHOSTS}*${NGPU_PER_HOST}))"
2 8 16
$ # replace "train_batch_size" with $NGPUS in ds_config.json
$ # and write to `ds_config-polaris.json`
$ sed \
    "s/$(cat ds_config.json| grep batch | cut -d ':' -f 2)/ ${NGPUS},/" \
    ds_config.json \
    > ds_config-polaris.json
$ cat ds_config-polaris.json
    "train_batch_size": 16,
    "gradient_accumulation_steps": 1,

  1. The flag -x ENVIRONMENT_VARIABLE ensures the $ENVIRONMENT_VARIABLE will be set in the launched processes. 

  2. Additional details for installing DeepSpeed can be found int their docs from: Installation Details