Skip to content

Argonne Leadership Computing Facility

Instructions for gpt-neox:

We include below a set of instructions to get EleutherAI/gpt-neox running on Polaris.

A batch submission script for the following example is available here.


The instructions below should be ran directly from a compute node.

Explicitly, to request an interactive job (from polaris-login):

$ qsub -A <project> -q debug-scaling -l select=2 -l walltime=01:00:00

Refer to job scheduling and execution for additional information.

  1. Load and activate the base conda environment:

    module load conda
    conda activate base

  2. We've installed the requirements for running gpt-neox into a virtual environment. To activate this environment,

    source /soft/datascience/venvs/polaris/2022-09-08/bin/activate

  3. Clone the EleutherAI/gpt-neox repository if it doesn't already exist:

    git clone

  4. Navigate into the gpt-neox directory:

    cd gpt-neox


    The remaining instructions assume you're inside the gpt-neox directory

  5. Create a DeepSpeed compliant hostfile (each line is formatted as hostname, slots=N):

    cat $PBS_NODEFILE > hostfile
    sed -e 's/$/ slots=4/' -i hostfile
    export DLTS_HOSTFILE=hostfile 

  6. Create a .deepspeed_env file to ensure a consistent environment across all workers

    echo "PATH=${PATH} > .deepspeed_env"
    echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH} >> .deepspeed_env"
    echo "http_proxy=${http_proxy} >> .deepspeed_env"
    echo "https_proxy=${https_proxy} >> .deepspeed_env"

  7. Prepare data:

    python3 -d ./data

  8. Train:

    python3 ./ -d configs small.yml local_setup.yml


If your training seems to be getting stuck at

Using /home/user/.cache/torch_extensions as PyTorch extensions root...

there may be a leftover .lock file from an aborted build. Cleaning either the whole .cache or the extensions' sub-directory should force a clean build on the next attempt.