Steps to Run a Model/Program
Note: Please be mindful of how you are using the system. For example, consider running larger jobs in the evening or on weekends.
Running of any model or application includes graph compilation of the model that is then deployed on the IPUs. Below is the description of training a neural network for classification on the MNIST dataset using the PopTorch (pytorch framework optimized for IPU).
Graphcore provides examples of some well-known AI applications in their repository at https://github.com/graphcore/examples.git.
Clone the examples repository to your personal directory structure, and checkout the v3.3.0 release:
Activate PopTorch Environment
Follows the steps at Poptorch environment setup to enable the Poplar SDK.
Change directory and install packages specific to the MNIST model:
Execute the command:
All models are run using Slurm, with the
--ipus indicating how many IPUs are need to be allocated for the model being run. This example uses a batchsize of 8, and run for 10 epochs. It also set the device iteration to 50 which is the number of iterations the device should run over the data before returning to the user. The dataset used in the example is derived from the TorchVision and the PopTorch dataloader is used to load the data required for the 50 device iterations from the host to the device in a single step.
The model used here is a simple CNN based model with an output from a classifier (softmax layer).
A simple Pytorch model is translated to a PopTorch model using
poptorch.trainingModel is the model wrapping function on the Pytorch model. The first call to
trainingModel will compile the model for the IPU. You can observe the compilation process as part of output of the above command.
The artifacts from the graph compilations is cached in the location set by the flag
POPTORCH_CACHE_DIR, where the
.popef file corresponding to the model under consideration is cached.
The expected output will start with downloads followed by and we can observe the model used by the model, the progress bar of the compilation process, and the training progress bar.
srun: job 10671 queued and waiting for resources srun: job 10671 has been allocated resources TrainingModelWithLoss( (model): Network( (layer1): Block( (conv): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1)) (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (relu): ReLU() ) (layer2): Block( (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1)) (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (relu): ReLU() ) (layer3): Linear(in_features=1600, out_features=128, bias=True) (layer3_act): ReLU() (layer3_dropout): Dropout(p=0.5, inplace=False) (layer4): Linear(in_features=128, out_features=10, bias=True) (softmax): Softmax(dim=1) ) (loss): CrossEntropyLoss() ) Epochs: 0%| | 0/10 [00:00<?,[23:27:06.753] [poptorch:cpp] [warning] [DISPATCHER] Type coerced from Long to Int for tensor id 10 Graph compilation: 100%|██████████| 100/100 [00:00<00:00] Epochs: 100%|██████████| 10/10 [01:17<00:00, 7.71s/it] Graph compilation: 100%|██████████| 100/100 [00:00<00:00] Accuracy on test set: 96.85%██████| 100/100 [00:00<00:00]
Refer to the script to learn more about this example.
Example Programs lists the different example applications with corresponding commands for each of the above steps.