Example Programs
Graphcore provides examples of some well-known AI applications in their repository at https://github.com/graphcore/examples.git. Clone the examples repository to your personal directory structure:
MNIST - PopTorch
Activate PopTorch Environment
Install Requirements
Change directory:
Run MNIST
Execute the command:
Output
The expected output will resemble the following:
srun: job 10671 queued and waiting for resources
srun: job 10671 has been allocated resources
TrainingModelWithLoss(
(model): Network(
(layer1): Block(
(conv): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
(pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(relu): ReLU()
)
(layer2): Block(
(conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
(pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(relu): ReLU()
)
(layer3): Linear(in_features=1600, out_features=128, bias=True)
(layer3_act): ReLU()
(layer3_dropout): Dropout(p=0.5, inplace=False)
(layer4): Linear(in_features=128, out_features=10, bias=True)
(softmax): Softmax(dim=1)
)
(loss): CrossEntropyLoss()
)
Epochs: 0%| | 0/10 [00:00<?,[23:27:06.753] [poptorch:cpp] [warning] [DISPATCHER] Type coerced from Long to Int for tensor id 10
Graph compilation: 100%|██████████| 100/100 [00:00<00:00]
Epochs: 100%|██████████| 10/10 [01:17<00:00, 7.71s/it]
Graph compilation: 100%|██████████| 100/100 [00:00<00:00]
Accuracy on test set: 96.85%██████| 100/100 [00:00<00:00]
MNIST - Tensorflow2
Activate Tensorflow2 Environment
Create a TensorFlow2 environment as explained in the tensorflow-2-environment-setup and activate the same.
Install Requirements
Change directory:
Run MNIST - TensorFlow
Execute the command:
Output
The expected output will resemble the following:
srun: job 10672 queued and waiting for resources
srun: job 10672 has been allocated resources
2023-08-22 23:35:02.925033: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:43] Poplar version: 3.3.0 (de1f8de2a7) Poplar package: b67b751185
2023-08-22 23:35:06.119772: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1619] TensorFlow device /device:IPU:0 attached to 1 IPU with Poplar device ID: 0
2023-08-22 23:35:07.087287: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2023-08-22 23:35:07.351132: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-08-22T23:35:09.469066Z PL:POPOPS 3545299.3545299 W: createOutputForElementWiseOp 'while/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits/fusion.3/Op/Equal/Out' ({32,10}): No suitable input found, creating new variable with linear tile mapping
2023-08-22 23:35:18.532415: I tensorflow/compiler/jit/xla_compilation_cache.cc:376] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
Epoch 1/4
2000/2000 [==============================] - 13s 6ms/step - loss: 0.6220
Epoch 2/4
2000/2000 [==============================] - 1s 262us/step - loss: 0.3265
Epoch 3/4
2000/2000 [==============================] - 1s 273us/step - loss: 0.2781
Epoch 4/4
2000/2000 [==============================] - 1s 289us/step - loss: 0.2482
ResNet50
Activate PopTorch Environment
Create and activate a fresh PopTorch environment poptorch33_resnet50_env
as outlined in the virtual environment section, then activate it.
Install Requirements
Change directory
Update configs.yml
Change directory:
Open configs.yml with your favorite editor. Find in the resnet50 section and change it to:Run ResNet50
The scripts to train a ResNet50 PyTorch model on Pod4 is located at https://github.com/graphcore/examples/tree/master/vision/cnns/pytorch/train
Set the following environmental variables.
To run 4 replicas (a total for 4 IPUs) of the ResNet50 model: Make a script with the following contents, called poprun_unet.shThis script tells poprun to use the partition id of the partition created for the slurm job used to run the script.
#!/bin/bash
poprun -vv --vipu-partition=slurm_${SLURM_JOBID} --num-instances=1 --num-replicas=4 --executable-cache-path=$PYTORCH_CACHE_DIR python3 /home/$USER/graphcore/examples/vision/cnns/pytorch/train/train.py --config resnet50-pod4 --imagenet-data-path /mnt/localdata/datasets/imagenet-raw-dataset --epoch 2 --validation-mode none --dataloader-worker 14 --dataloader-rebatch-size 256
This model is run with the imagenet dataset.
Output
The expected output starts with this:
srun: job 10675 queued and waiting for resources
srun: job 10675 has been allocated resources
23:48:29.160 3555537 POPRUN [I] V-IPU server address picked up from 'vipu': 10.1.3.101:8090
23:48:29.160 3555537 POPRUN [D] Connecting to 10.1.3.101:8090
23:48:29.162 3555537 POPRUN [D] Status for partition slurm_10673: OK (error 0)
23:48:29.162 3555537 POPRUN [I] Partition slurm_10673 already exists and is in state: PS_ACTIVE
23:48:29.163 3555537 POPRUN [D] The reconfigurable partition slurm_10673 is OK
===========================
| poprun topology |
|===========================|
| hosts | gc-poplar-02 |
|-----------|---------------|
| ILDs | 0 |
|-----------|---------------|
| instances | 0 |
|-----------|---------------|
| replicas | 0 | 1 | 2 | 3 |
---------------------------
23:48:29.163 3555537 POPRUN [D] Target options from environment: {}
23:48:29.163 3555537 POPRUN [D] Target options from V-IPU partition: {"ipuLinkDomainSize":"4","ipuLinkConfiguration":"default","ipuLinkTopology":"mesh","gatewayMode":"true","instanceSize":"4"}
23:48:29.207 3555537 POPRUN [D] Found 1 devices with 4 IPUs
23:48:29.777 3555537 POPRUN [D] Attached to device 6
23:48:29.777 3555537 POPRUN [I] Preparing parent device 6
23:48:29.777 3555537 POPRUN [D] Device 6 ipuLinkDomainSize=64, ipuLinkConfiguration=Default, ipuLinkTopology=Mesh, gatewayMode=true, instanceSize=4
23:48:33.631 3555537 POPRUN [D] Target options from Poplar device: {"ipuLinkDomainSize":"64","ipuLinkConfiguration":"default","ipuLinkTopology":"mesh","gatewayMode":"true","instanceSize":"4"}
23:48:33.631 3555537 POPRUN [D] Using target options: {"ipuLinkDomainSize":"4","ipuLinkConfiguration":"default","ipuLinkTopology":"mesh","gatewayMode":"true","instanceSize":"4"}
Graph compilation: 100%|██████████| 100/100 [00:04<00:00][1,0]<stderr>:2023-08-22T23:49:40.103248Z PO:ENGINE 3556102.3556102 W: WARNING: The compile time engine option debug.branchRecordTile is set to "5887" when creating the Engine. (At compile time it was set to 1471)
[1,0]<stderr>:
Loss:6.7539 [1,0]<stdout>:[INFO] Epoch 1████▌| 75/78 [02:42<00:06, 2.05s/it][1,0]<stderr>:
[1,0]<stdout>:[INFO] loss: 6.7462,
[1,0]<stdout>:[INFO] accuracy: 0.62 %
[1,0]<stdout>:[INFO] throughput: 7599.7 samples/sec
[1,0]<stdout>:[INFO] Epoch 2/2
Loss:6.7462 | Accuracy:0.62%: 100%|██████████| 78/78 [02:48<00:00, 2.16s/it][1,0]<stderr>:
Loss:6.2821 | Accuracy:2.42%: 96%|█████████▌| 75/7[1,0]<stdout>:[INFO] Epoch 2,0]<stderr>:
[1,0]<stdout>:[INFO] loss: 6.2720,
[1,0]<stdout>:[INFO] accuracy: 2.48 %
[1,0]<stdout>:[INFO] throughput: 8125.8 samples/sec
[1,0]<stdout>:[INFO] Finished training. Time: 2023-08-22 23:54:57.853508. It took: 0:05:26.090631
Loss:6.2720 | Accuracy:2.48%: 100%|██████████| 78/78 [02:37<00:00, 2.02s/it][1,0]<stderr>:
[1,0]<stderr>:/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown
[1,0]<stderr>: warnings.warn('resource_tracker: There appear to be %d '
23:55:02.722 3555537 POPRUN [I] mpirun (PID 3556098) terminated with exit code 0
GPT-2 PyTorch - POD16 run
The scripts to train a GPT-2 pytorch model on the POD16 are located at https://github.com/graphcore/examples/tree/master/nlp/gpt2/pytorch
In order to run the GPT-2 Pytorch model, create a new popTorch virtual environment poptorch33_gpt2 as described in the virtual environment section and activate it.
Install Requirements
Change directory:
Run GPT2 on 16 IPUs
The command for the GPT2 model is as follows is as follows.
/opt/slurm/bin/srun --ipus=16 python /home/$USER/graphcore/examples/nlp/gpt2/pytorch/train_gpt2.py --model gpt2 --ipus-per-replica 4 --replication-factor 4 --gradient-accumulation 2048 --device-iterations 8 --batch-size 1 --layers-per-ipu 0 4 4 4 --matmul-proportion 0.15 0.15 0.15 0.15 --max-len 1024 --optimizer AdamW --learning-rate 0.00015 --lr-schedule cosine --lr-warmup 0.01 --remap-logit True --enable-sequence-serialized True --embedding-serialization-factor 4 --recompute-checkpoint-every-layer True --enable-half-partials True --replicated-tensor-sharding True --dataset 'generated' --epochs 1
gpt2
model that fits on 4 IPUS indicated by --ipus-per-replica
. The --replication-factor
indicates how many times the model is replicated in a data parallel manner (4 in the above example). Hence the total number of IPUs used in this example is 16.
The effective global batch size in this example is (micro)batch-size * gradient-accumulation * replication-factor = 1 x 2048 x 4 = 8192. The device iterations indicates the total number samples loaded in 1 training step = global batch size * device iterations = 8192*8 = 65536. To learn more about these parameters and in general batching of IPUs refer IPU batching .
The above example is running with generated
or synthetic data
. To use the same example with a real world dataset, refer to data setup.
Output
Expected output starts with the following:
srun: job 10697 queued and waiting for resources
srun: job 10697 has been allocated resources
Building (if necessary) and loading remap_tensor_ce.
Failed to find compiled extension; rebuilding.
Building (if necessary) and loading residual_add_inplace_pattern.
Model initializing
-------------------- Device Allocation --------------------
Embedding --> IPU 0
Layer 0 --> IPU 1
Layer 1 --> IPU 1
Layer 2 --> IPU 1
Layer 3 --> IPU 1
Layer 4 --> IPU 2
Layer 5 --> IPU 2
Layer 6 --> IPU 2
Layer 7 --> IPU 2
Layer 8 --> IPU 3
Layer 9 --> IPU 3
Layer 10 --> IPU 3
Layer 11 --> IPU 3
LM_head --> IPU 0
step 0 of epoch 0, loss: 10.913220405578613, acc: 2.0071864128112793e-05, lr: 0.00012803300858899104, throughput: 646.8439205981404 samples/sec
step 1 of epoch 0, loss: 10.836345672607422, acc: 1.9788742065429688e-05, lr: 7.5e-05, throughput: 1058.0979097185766 samples/sec
step 2 of epoch 0, loss: 10.831247329711914, acc: 2.0518898963928223e-05, lr: 2.1966991411008938e-05, throughput: 1058.7595523807183 samples/sec
step 3 of epoch 0, loss: 10.829034805297852, acc: 1.990795135498047e-05, lr: 0.0, throughput: 1059.6762623043378 samples/sec
Note: The graph compilation for a large model like GPT-2 takes about half an hour.