Instructions for `gpt-neox`:

We include below a set of instructions to get EleutherAI/gpt-neox running on Polaris.

A batch submission script for the following example is available here.

Warning

The instructions below should be ran directly from a compute node.

Explicitly, to request an interactive job (from polaris-login):

$ qsub -A <project> -q debug-scaling -l select=2 -l walltime=01:00:00

Refer to job scheduling and execution for additional information.

Load and activate the base conda environment:
```
module load conda
conda activate base
```
We've installed the requirements for running gpt-neox into a virtual environment. To activate this environment,
```
source /soft/datascience/venvs/polaris/2022-09-08/bin/activate
```
Clone the EleutherAI/gpt-neox repository if it doesn't already exist:
```
git clone https://github.com/EleutherAI/gpt-neox
```
Navigate into the gpt-neox directory:
```
cd gpt-neox
```
Note

The remaining instructions assume you're inside the gpt-neox directory

Create a DeepSpeed compliant hostfile (each line is formatted as hostname, slots=N):

cat $PBS_NODEFILE > hostfile
sed -e 's/$/ slots=4/' -i hostfile
export DLTS_HOSTFILE=hostfile

Create a .deepspeed_env file to ensure a consistent environment across all workers

echo "PATH=${PATH} > .deepspeed_env"
echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH} >> .deepspeed_env"
echo "http_proxy=${http_proxy} >> .deepspeed_env"
echo "https_proxy=${https_proxy} >> .deepspeed_env"

Prepare data:
```
python3 prepare_data.py -d ./data
```

Train:

python3 ./deepy.py train.py -d configs small.yml local_setup.yml

Danger

If your training seems to be getting stuck at

Using /home/user/.cache/torch_extensions as PyTorch extensions root...

there may be a leftover .lock file from an aborted build. Cleaning either the whole .cache or the extensions' sub-directory should force a clean build on the next attempt.

Instructions for gpt-neox:

Instructions for `gpt-neox`: