Running GPT-2 on Multiple Nodes¶
This GPT-2 example is for 1.5B parameters on two (2) nodes. Each node has eight (8) RDUs for a total of sixteen (16) RDUs.
Create a Directory¶
Establish Script¶
Using your favorite editor, create the file ''.
Copy the contents of
Make the script executable:
Multiple Nodes¶ contains the sbatch command:
/usr/local/bin/sbatch --output=${HOME}/slurm-%A.out --ntasks 32 --gres=rdu:1 --ntasks-per-node 16 --nodes 2 --cpus-per-task=8 /data/ANL/scripts/ ${1} >> ${OUTPUT_PATH} 2>&1
The sbatch nodes argument specifies the number of nodes to use.
nodes 2 Nodes to use.
Additionally, here are the other sbatch arguments.
--ntasks 32: This option specifies the number of tasks to be used in the job.
ntasks-per-node 16: This option specifies the number of tasks per node.
gres=rdu:1 Indicates the model fits on a single RDU.
cpus-per-task=8 CPUs per task.
The script accepts an optional first parameter to specify the log directory.
Run the script:
The output can be found at /data/ANL/results/$(hostname)/${USER}/${LOGDIR}/${MODEL_NAME}.out. The actual path will be displayed on the screen.