Megatron-DeepSpeed

We describe below the instructions for launching distributed training with Microsoft's Megatron-DeepSpeed and briefly describe some parallelism strategies and various optimizations that are supported.

Note

We maintain a forked version at argonne-lcf/Megatron-DeepSpeed that has some helper scripts for launching and setting various training options.

Setup

Load conda and activate base environment:

# load conda + activate base env
module load conda/2023-10-04 ; conda activate base

Clone argonne-lcf/Megatron-DeepSpeed and navigate into it:

# clone + navigate into Megatron-DeepSpeed repo
git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
cd Megatron-DeepSpeed

Make virtual environment (on top of base conda):

# make virtual environment (on top of base conda)
mkdir -p venvs/polaris/2023-10-04
python3 -m venv venvs/polaris/2023-10-04 --system-site-packages
source venvs/polaris/2023-10-04/bin/activate

Install missing dependency:

# install *missing dependency
python3 -m pip install "git+https://github.com/saforem2/ezpz"

Launch training:

# ---- launch training -----------------------
# - MODEL_SIZE_KEY: defined in ALCF/model.sh
# - other args: defined in ALCF/args.sh
# ---------------------------------------------
MODEL_SIZE_KEY="GPT25B" \
    SEQ_LEN=4096 \ 
    USE_FLASH_ATTN_V2=1 \
    MICRO_BATCH=1 \
    GAS=1 \
    SP_TYPE="megatron" \
    ZERO_STAGE=1 \
    ./ALCF/train-gpt3.sh

Helper Scripts

ALCF/train-gpt3.sh: Main entry point for training. This script will automatically source the rest of the required ALCF/*.sh scripts below
ALCF/model.sh: Contains some example model architectures for GPT3-style models
ALCF/args.sh: Logic for parsing / setting up runtime options for Megatron and DeepSpeed.
ALCF/setup.sh: Locate and activate virtual environment to be used, ensure MPI variables are set properly
ALCF/launch.sh: Identify available resources and build the command to be ran i.e. figure out how many: {nodes, GPUs per node, GPUs total}, to pass to mpi{run,exec} then, use this to build mpiexec <mpiexec-args> python3 pretrain_gpt.py