Megatron-DeepSpeed
We describe below the instructions for launching distributed training with Microsoft's Megatron-DeepSpeed and briefly describe some parallelism strategies and various optimizations that are supported.
Note
We maintain a forked version at
argonne-lcf/Megatron-DeepSpeed
that has some helper scripts for launching and setting
various training options.
Setup
-
Load
conda
and activate base environment: -
Clone
argonne-lcf/Megatron-DeepSpeed
and navigate into it: -
Make virtual environment (on top of base conda):
-
Install missing dependency:
-
Launch training:
# ---- launch training ----------------------- # - MODEL_SIZE_KEY: defined in ALCF/model.sh # - other args: defined in ALCF/args.sh # --------------------------------------------- MODEL_SIZE_KEY="GPT25B" \ SEQ_LEN=4096 \ USE_FLASH_ATTN_V2=1 \ MICRO_BATCH=1 \ GAS=1 \ SP_TYPE="megatron" \ ZERO_STAGE=1 \ ./ALCF/train-gpt3.sh
Helper Scripts
ALCF/train-gpt3.sh
-
Main entry point for training. This script will automatically source the rest of the required ALCF/*.sh scripts below
ALCF/model.sh
-
Contains some example model architectures for GPT3-style models
ALCF/args.sh
-
Logic for parsing / setting up runtime options for Megatron and DeepSpeed.
ALCF/setup.sh
-
Locate and activate virtual environment to be used, ensure MPI variables are set properly
ALCF/launch.sh
-
Identify available resources and build the command to be ran i.e. figure out how many:
{nodes, GPUs per node, GPUs total}
, to pass tompi{run,exec}
then, use this to buildmpiexec <mpiexec-args> python3 pretrain_gpt.py