Skip to content

Argonne Leadership Computing Facility

Example Programs

Graphcore provides examples of some well-known AI applications in their repository at https://github.com/graphcore/examples.git. Clone the examples repository to your personal directory structure:

mkdir ~/graphcore
cd ~/graphcore
git clone https://github.com/graphcore/examples.git

MNIST - PopTorch

Activate PopTorch Environment

source ~/venvs/graphcore/poptorch31_env/bin/activate

Install Requirements

Change directory:

cd ~/graphcore/examples/tutorials/simple_applications/pytorch/mnist
pip install torchvision==0.14.0

Run MNIST

Execute the command:

/opt/slurm/bin/srun --ipus=1 python mnist_poptorch.py

Output

The expected output will start with downloads followed by:

TrainingModelWithLoss(
  (model): Network(
    (layer1): Block(
      (conv): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (relu): ReLU()
    )
    (layer2): Block(
      (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (relu): ReLU()
    )
    (layer3): Linear(in_features=1600, out_features=128, bias=True)
    (layer3_act): ReLU()
    (layer3_dropout): Dropout(p=0.5, inplace=False)
    (layer4): Linear(in_features=128, out_features=10, bias=True)
    (softmax): Softmax(dim=1)
  )
  (loss): CrossEntropyLoss()
)
Epochs:   0%|
...
Graph compilation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:16<00:00]
Accuracy on test set: 98.04%

MNIST - Tensorflow2

Activate Tensorflow2 Environment

Create a TensorFlow2 environment as explained in the tensorflow-2-environment-setup and activate the same.

source ~/venvs/graphcore/tensorflow2_31_env/bin/activate

Install Requirements

Change directory:

cd ~/graphcore/examples/tutorials/simple_applications/tensorflow2/mnist/

Run MNIST - TensorFlow

Execute the command:

/opt/slurm/bin/srun --ipus=1 python mnist.py

Output

The expected output will start with downloads followed by:

2023-04-26 14:42:32.179566: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:43] Poplar version: 3.1.0 (e12d5f9f01) Poplar package: 9c103dc348
2023-04-26 14:42:34.517107: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1619] TensorFlow device /device:IPU:0 attached to 1 IPU with Poplar device ID: 0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
11501568/11490434 [==============================] - 0s 0us/step
2023-04-26 14:42:35.673768: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2023-04-26 14:42:35.947832: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-04-26 14:42:46.953720: I tensorflow/compiler/jit/xla_compilation_cache.cc:376] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Epoch 1/4
2000/2000 [==============================] - 13s 7ms/step - loss: 0.6238
Epoch 2/4
2000/2000 [==============================] - 0s 222us/step - loss: 0.3361
Epoch 3/4
2000/2000 [==============================] - 0s 225us/step - loss: 0.2894
Epoch 4/4
2000/2000 [==============================] - 0s 226us/step - loss: 0.2601

ResNet50

Activate PopTorch Environment

Create and activate a fresh PopTorch environment poptorch31_resnet50_env as outlined in the virtual environment section, then activate it.

source ~/venvs/graphcore/poptorch31_resnet50_env/bin/activate

Install Requirements

Change directory

cd ~/graphcore/examples/vision/cnns/pytorch
make install 
make install-turbojpeg
pip install torch==1.13.0

Note: For 3.1.0 sdk, use the torch=1.13.0 version for the compatible version.

Update configs.yml

Change directory:

cd ~/graphcore/examples/vision/cnns/pytorch/train
Open configs.yml with your favorite editor. Find in the resnet50 section
use_bbox_info: true
and change it to:
use_bbox_info: false

Run ResNet50

The scripts to train a ResNet50 PyTorch model on Pod4 is located at https://github.com/graphcore/examples/tree/master/vision/cnns/pytorch/train

Set the following environmental variables.

mkdir -p ~/graphcore/tmp/pt_cache/
export PYTORCH_CACHE_DIR=~/graphcore/tmp/pt_cache/
The command to run 4 replicas (a total for 4 IPUs) of the ResNet50 model is as follows.
/opt/slurm/bin/srun --ipus=4 poprun -vv --num-instances=1 --num-replicas=4 --executable-cache-path=$PYTORCH_CACHE_DIR python3 /home/$USER/graphcore/examples/vision/cnns/pytorch/train/train.py --config resnet50-pod4 --imagenet-data-path /mnt/localdata/datasets/imagenet-raw-dataset --epoch 2 --validation-mode none --dataloader-worker 14 --dataloader-rebatch-size 256
This model is run with the imagenet dataset.

Output

04:22:59.948 3905692 POPRUN [I] V-IPU server address picked up from 'vipu': 10.1.3.101:8090
04:22:59.950 3905692 POPRUN [I] Using V-IPU partition slurm_2657 as it is the only one available
04:22:59.950 3905692 POPRUN [D] Connecting to 10.1.3.101:8090
04:22:59.951 3905692 POPRUN [D] Status for partition slurm_2657: OK (error 0)
04:22:59.951 3905692 POPRUN [I] Partition slurm_2657 already exists and is in state: PS_ACTIVE
04:22:59.952 3905692 POPRUN [D] The reconfigurable partition slurm_2657 is OK
 ===========================
|      poprun topology      |
|===========================|
| hosts     | gc-poplar-02  |
|-----------|---------------|
| ILDs      |       0       |
|-----------|---------------|
| instances |       0       |
|-----------|---------------|
| replicas  | 0 | 1 | 2 | 3 |
 ---------------------------
04:22:59.952 3905692 POPRUN [D] Target options from environment: {}
04:22:59.952 3905692 POPRUN [D] Target options from V-IPU partition: {"ipuLinkDomainSize":"4","ipuLinkConfiguration":"default","ipuLinkTopology":"mesh","gatewayMode":"true","instanceSize":"4"}
04:22:59.998 3905692 POPRUN [D] Found 1 devices with 4 IPUs
04:23:00.689 3905692 POPRUN [D] Attached to device 6
04:23:00.689 3905692 POPRUN [I] Preparing parent device 6
04:23:00.689 3905692 POPRUN [D] Device 6 ipuLinkDomainSize=64, ipuLinkConfiguration=Default, ipuLinkTopology=Mesh, gatewayMode=true, instanceSize=4
[1,0]<stdout>:[INFO] Total replicas: 4
[1,0]<stdout>:[INFO] Global batch size: 16416
[1,0]<stdout>:[INFO] Number of IPUs required: 4
[1,0]<stdout>:[INFO] Loading the data
Graph compilation: 100%|██████████| 100/100 [06:26<00:00][1,0]<stderr>:WARNING: The compile time engine option debug.branchRecordTile is set to "5887" when creating the Engine. (At compile-tile it was set to 1471)
[1,0]<stderr>:2023-04-27T04:30:33.475912Z PO:ENGINE   3906481.3906481 W: WARNING: The compile time engine option debug.branchRecordTile is set to "5887" when creating the Engine. (At compile-tile it was set to 1471)
[1,0]<stderr>:2023-04-27T04:30:36.928499Z popart:session 3906481.3906481 W: Rng state buffer was not serialized.You did not load poplar Engine.Remember that if you would like to run the model using the model runtime then you have to create your own buffer and callback in your model runtime application for rngStateTensor.
[1,0]<stderr>:
Loss:6.7615 | Accuracy:0.57%:  96%|█████████▌| 75/78 [11:07<00:10,  3.62s/it][1,0[1,0]<stdout>:[INFO] Epoch 1
[1,0]<stdout>:[INFO] loss: 6.7508,
[1,0]<stdout>:[INFO] accuracy: 0.61 %
[1,0]<stdout>:[INFO] throughput: 1886.4 samples/sec
[1,0]<stdout>:[INFO] Epoch 2/2
Loss:6.7508 | Accuracy:0.61%: 100%|██████████| 78/78 [11:18<00:00,  8.70s/it][1,0]<stderr>:
Loss:6.2860 | Accuracy:2.41%:  96%|█████████▌| 75/7[1,0]<stdout>:[INFO] Epoch 2,0]<stderr>:
[1,0]<stdout>:[INFO] loss: 6.2747,
[1,0]<stdout>:[INFO] accuracy: 2.48 %
[1,0]<stdout>:[INFO] throughput: 4476.7 samples/sec
[1,0]<stdout>:[INFO] Finished training. Time: 2023-04-27 04:40:05.821555. It took: 0:16:04.818638
Loss:6.2747 | Accuracy:2.48%: 100%|██████████| 78/78 [04:46<00:00,  3.67s/it][1,0]<stderr>:

GPT-2 PyTorch - POD16 run

The scripts to train a GPT-2 pytorch model on the POD16 are located at https://github.com/graphcore/examples/tree/master/nlp/gpt2/pytorch

In order to run the GPT-2 Pytorch model, create a new popTorch virtual environment poptorch31_gpt2 as described in the virtual environment section and activate it.

source ~/venvs/graphcore/poptorch31_gpt2/bin/activate

Install Requirements

Change directory:

cd ~/graphcore/examples/nlp/gpt2/pytorch
pip3 install -r requirements.txt

Run GPT2 on 16 IPUs

The command for the GPT2 model is as follows is as follows.

/opt/slurm/bin/srun --ipus=16 python /home/$USER/graphcore/examples/nlp/gpt2/pytorch/train_gpt2.py --model gpt2 --ipus-per-replica 4 --replication-factor 4 --gradient-accumulation 2048 --device-iterations 8 --batch-size 1 --layers-per-ipu 0 4 4 4 --matmul-proportion 0.15 0.15 0.15 0.15 --max-len 1024 --optimizer AdamW --learning-rate 0.00015 --lr-schedule cosine --lr-warmup 0.01 --remap-logit True --enable-sequence-serialized True --embedding-serialization-factor 4 --recompute-checkpoint-every-layer True --enable-half-partials True --replicated-tensor-sharding True --dataset 'generated' --epochs 1
It runs a gpt2 model that fits on 4 IPUS indicated by --ipus-per-replica. The --replication-factor indicates how many times the model is replicated in a data parallel manner (4 in the above example). Hence the total number of IPUs used in this example is 16.

The effective global batch size in this example is (micro)batch-size * gradient-accumulation * replication-factor = 1 x 2048 x 4 = 8192. The device iterations indicates the total number samples loaded in 1 training step = global batch size * device iterations = 8192*8 = 65536. To learn more about these parameters and in general batching of IPUs refer IPU batching .

The above example is running with generated or synthetic data. To use the same example with a real world dataset, refer to data setup.

Output

Building (if necessary) and loading remap_tensor_ce.
Failed to find compiled extension; rebuilding.
Building (if necessary) and loading residual_add_inplace_pattern.
Model initializing
-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Layer 0  --> IPU 1
Layer 1  --> IPU 1
Layer 2  --> IPU 1
Layer 3  --> IPU 1
Layer 4  --> IPU 2
Layer 5  --> IPU 2
Layer 6  --> IPU 2
Layer 7  --> IPU 2
Layer 8  --> IPU 3
Layer 9  --> IPU 3
Layer 10 --> IPU 3
Layer 11 --> IPU 3
LM_head --> IPU 0
Arguments: Namespace(async_dataloader=False, auto_loss_scaling=False, batch_size=1, checkpoint_input_dir='', checkpoint_output_dir=None, compile_only=False, custom_ops=True, dataset='generated', device_iterations=8, embedding_serialization_factor=4, enable_half_partials=True, enable_sequence_serialized=True, epochs=1, executable_cache_dir=None, gradient_accumulation=2048, input_files=None, ipus_per_replica=4, layers_per_ipu=[0, 4, 4, 4], learning_rate=0.00015, log_steps=1, loss_scaling=50000.0, lr_decay_steps=None, lr_schedule='cosine', lr_warmup=0.01, lr_warmup_steps=None, matmul_proportion=[0.15, 0.15, 0.15, 0.15], max_len=1024, model='gpt2', num_workers=4, optimizer='AdamW', optimizer_state_offchip=True, recompute_checkpoint_every_layer=True, recompute_checkpoint_layers=None, remap_logit=True, replicated_tensor_sharding=True, replication_factor=4, resume_training_from_checkpoint=False, save_per_epochs=1, save_per_steps=None, seed=1234, serialized_seq_len=128, stride=128, training_steps=10000, use_popdist=False, use_wandb=False, val_num=0, weight_decay=0.0)
Model config: GPT2Config {
  "activation_function": "gelu",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50272,
  "embd_pdrop": 0.1,
  "eos_token_id": 50272,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "output_past": true,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 400
    }
  },
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50272
}

------------------- Data Loading Started ------------------
loading training dataset and validating dataset
Samples per epoch: 262144
Steps per epoch: 4
Data loaded in 2.358953586081043 secs
-----------------------------------------------------------
--------------------- Training Started --------------------
Graph compilation:   4%|▍         | 4/100 [00:29<11:57]2023-04-27T03:39:53.291853Z PL:POPLIN    3888383.3888383 W: poplin::preplanMatMuls() is deprecated! Use poplin::preplan() instead
MatMuls() is deprecated! Use poplin::preplan() instead
2023-04-27T03:39:55.159194Z PL:POPLIN    3888383.3888383 W: poplin::preplanMatMuls() is deprecated! Use poplin::preplan() instead
2023-04-27T03:39:56.958834Z PL:POPLIN    3888383.3888383 W: poplin::preplanMatMuls() is deprecated! Use poplin::preplan() instead
2023-04-27T03:39:58.748727Z PL:POPLIN    3888383.3888383 W: poplin::preplanMatMuls() is deprecated! Use poplin::preplan() instead

Graph compilation: 100%|██████████| 100/100 [28:04<00:00]WARNING: The compile time engine option debug.branchRecordTile is set to "23551" when creating the Engine. (At compile-tile it was set to 5887)
2023-04-27T04:07:29.993259Z PO:ENGINE   3888383.3888383 W: WARNING: The compile time engine option debug.branchRecordTile is set to "23551" when creating the Engine. (At compile-tile it was set to 5887)
2023-04-27T04:07:42.941039Z popart:session 3888383.3888383 W: Rng state buffer was not serialized.You did not load poplar Engine.Remember that if you would like to run the model using the model runtime then you have to create your own buffer and callback in your model runtime application for rngStateTensor.

[04:09:02.177] [poptorch::python] [warning] Ignoring unexpected optimizer attribute in ADAMW_NO_BIAS optimizer: ['step', '_step_count']
Ignoring unexpected optimizer attribute in ADAMW_NO_BIAS optimizer: ['step', '_step_count']
[04:09:02.179] [poptorch::python] [warning] Ignoring unexpected group 0 attribute in ADAMW_NO_BIAS optimizer: ['initial_lr']
Ignoring unexpected group 0 attribute in ADAMW_NO_BIAS optimizer: ['initial_lr']
[04:09:02.179] [poptorch::python] [warning] Ignoring unexpected group 1 attribute in ADAMW_NO_BIAS optimizer: ['initial_lr']
Ignoring unexpected group 1 attribute in ADAMW_NO_BIAS optimizer: ['initial_lr']
step 0 of epoch 0, loss: 10.913212776184082, acc: 2.0116567611694336e-05, lr: 0.00012803300858899104, throughput: 36.69187444207895 samples/sec
step 1 of epoch 0, loss: 10.836352348327637, acc: 1.9758939743041992e-05, lr: 7.5e-05, throughput: 1064.3232077940409 samples/sec
step 2 of epoch 0, loss: 10.83123779296875, acc: 2.0459294319152832e-05, lr: 2.1966991411008938e-05, throughput: 1064.3064018230857 samples/sec
step 3 of epoch 0, loss: 10.829036712646484, acc: 1.9878149032592773e-05, lr: 0.0, throughput: 1064.4397806661352 samples/sec

Note: The graph compilation for a large model like GPT-2 takes about half an hour.