We describe below the instructions for launching distributed training with Microsoft's Megatron-DeepSpeed and briefly describe some parallelism strategies and various optimizations that are supported.


We maintain a forked version at argonne-lcf/Megatron-DeepSpeed that has some helper scripts for launching and setting various training options.


  1. Load conda and activate base environment:

    # load conda + activate base env
    module load conda/2023-10-04 ; conda activate base
  2. Clone argonne-lcf/Megatron-DeepSpeed and navigate into it:

    # clone + navigate into Megatron-DeepSpeed repo
    git clone
    cd Megatron-DeepSpeed
  3. Make virtual environment (on top of base conda):

    # make virtual environment (on top of base conda)
    mkdir -p venvs/polaris/2023-10-04
    python3 -m venv venvs/polaris/2023-10-04 --system-site-packages
    source venvs/polaris/2023-10-04/bin/activate
  4. Install missing dependency:

    # install *missing dependency
    python3 -m pip install "git+"
  5. Launch training:

    # ---- launch training -----------------------
    # - MODEL_SIZE_KEY: defined in ALCF/
    # - other args: defined in ALCF/
    # ---------------------------------------------
        SEQ_LEN=4096 \ 
        USE_FLASH_ATTN_V2=1 \
        MICRO_BATCH=1 \
        GAS=1 \
        SP_TYPE="megatron" \
        ZERO_STAGE=1 \

Helper Scripts


Main entry point for training. This script will automatically source the rest of the required ALCF/*.sh scripts below


Contains some example model architectures for GPT3-style models


Logic for parsing / setting up runtime options for Megatron and DeepSpeed.


Locate and activate virtual environment to be used, ensure MPI variables are set properly


Identify available resources and build the command to be ran i.e. figure out how many: {nodes, GPUs per node, GPUs total}, to pass to mpi{run,exec} then, use this to build mpiexec <mpiexec-args> python3