Megatron-DeepSpeed
Megatron-DeepSpeed is a scalable, highly performant library for training large language models on any GPU2.
In particular, it retains the core 4D parallelism1 functionality of the
NVIDIA / Megatron-LM
library, while leveraging the
microsoft / DeepSpeed
library for efficient
scaling and 🍋 saforem2 / ezpz
for automated device + backend selection.
Getting Started
-
Clone the argonne-lcf /
Megatron-DeepSpeed
repository: -
Setup your environment:
-
Install dependencies:
-
Launch training:
# Before launching, `PBS_O_WORKDIR` should be set to Megatron-DeepSpeed's PATH # and venv inside Megatron-DeepSpeed/venv should be activated. TP=2 NLAYERS=10 DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.sh
This will launch a distributed pre-training run with:
-
NLAYERS=10
: Llama style model consisting of 10 layers -
TP=2
: Split across 2 Tensor Parallel groups -
DATA_FILE_LIST
: Using the Books corpus of the Dolma dataset
Overridable Options
This is a simple subset of the overridable options.
The full list (as well as their default values) can be found in ALCF /
helpers.sh
DTYPE
: Data typeDATA_FILE_LIST
: Data file listFFN_HIDDEN_SIZE
: Feedforward Neural Network projection sizeGRAD_ACC_STEPS
: Gradient accumulation stepsHEADS
: Number of attention headsHIDDEN
: Hidden sizeMICRO_BATCH
: Micro batch sizeNO_FLASH_ATTN
: No Flash AttentionNLAYERS
: Number of layersNUM_KV_HEAD
: Number of key-value headsOPT
: Optimizeradam
adam8bit
adamw
adamwschedulefree
apex.adam
apex.sgd
ds.fusedlamb
ds.onebitlamb
galoreadamw
galoreadamw8bit
galoreadamw8bitperlayer
ipex.fusedlamb
ipex.lamb
shampoo
sgd
sgdschedulefree
sophiag
PP
: Pipeline parallelism degreeSEQ
: Sequence lengthSP
: Sequence parallelism (Ulysses) degreeTP
: Tensor parallelism degreeTRAIN_TOKENS
: Number of training tokensTRAIN_ITERS
: Number of training iterations
USE_ACTIVATION_CHECKPOINTING
: Use activation checkpointingWEIGHT_DECAY
: Weight decayZERO_STAGE
: Zero stage
-