Skip to content

Argonne Leadership Computing Facility

Running a Model/Program

Getting Started

Job submission and queuing

Cerebras jobs are initiated and tracked automatically within the Python frameworks in modelzoo.common.pytorch.run_utils and modelzoo.common.tf.run_utils. These frameworks interact with the Cerebras cluster management node.

Login nodes

Jobs are launched from login nodes. If you expect a loss of an internet connection for any reason, for long-running jobs we suggest logging into a specific login node and using either screen or tmux to create persistent command line sessions. For details use:

man screen
# or
man tmux

Execution mode:

The CS-2 system supports two modes of execution.
1. Pipeline mode.
This mode is used for smaller models (fewer than 1 billion parameters).
2. Weight streaming mode.
Weight streaming mode uses the host memory of the Cerebras cluster's MemoryX nodes to store and broadcast model weights, and supports larger models compared to pipelined mode.

Running jobs on the wafer

Follow these instructions to compile and train the fc_mnist TensorFlow and PyTorch samples. These models are a couple of fully connected layers plus dropout and RELU.

Cerebras virtual environments

First, make virtual environments for Cerebras for PyTorch and/or TensorFlow. See Customizing Environments for the procedures for making custom PyTorch and/or TensorFlow virtual environments for Cerebras. If the environments are made in ~/R_1.8.0/, then they would be activated as follows:

source ~/R_1.8.0/venv_pt/bin/activate
or
source ~/R_1.8.0/vent_tf/bin/activate

Clone the Cerebras modelzoo

mkdir ~/R_1.8.0
cd ~/R_1.8.0
git clone https://github.com/Cerebras/modelzoo.git
cd modelzoo
git tag
git checkout Release_1.8.0

Running a Pytorch sample

Activate your PyTorch virtual environment, and change to the working directory

source ~/R_1.8.0/venv_pt/bin/activate
cd ~/R_1.8.0/modelzoo/modelzoo/fc_mnist/pytorch

Next, edit configs/params.yaml, making the following changes:

 train_input:
-    data_dir: "./data/mnist/train"
+    data_dir: "/software/cerebras/dataset/fc_mnist/data/mnist/train"

and

 eval_input:
-    data_dir: "./data/mnist/val"
+    data_dir: "/software/cerebras/dataset/fc_mnist/data/mnist/val"

If you want to have the sample download the dataset, you will need to specify absolute paths for the "data_dir"s

Running a sample PyTorch training job

To run the sample:

export MODEL_DIR=model_dir
# deletion of the model_dir is only needed if sample has been previously run
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX pipeline --job_labels name=pt_smoketest --params configs/params.yaml --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_1.8.0/modelzoo --compile_dir /$(whoami) |& tee mytest.log

A successful fc_mnist PyTorch training run should finish with output resembling the following:

2023-05-15 16:05:54,510 INFO:   | Train Device=xla:0, Step=9950, Loss=2.30234, Rate=157300.30 samples/sec, GlobalRate=26805.42 samples/sec
2023-05-15 16:05:54,571 INFO:   | Train Device=xla:0, Step=10000, Loss=2.29427, Rate=125599.14 samples/sec, GlobalRate=26905.42 samples/sec
2023-05-15 16:05:54,572 INFO:   Saving checkpoint at global step 10000
2023-05-15 16:05:59,734 INFO:   Saving step 10000 in dataloader checkpoint
2023-05-15 16:06:00,117 INFO:   Saved checkpoint at global step: 10000
2023-05-15 16:06:00,117 INFO:   Training Complete. Completed 1280000 sample(s) in 53.11996841430664 seconds.
2023-05-15 16:06:04,356 INFO:   Monitoring returned

Running a TensorFlow sample

Activate your TensorFlow virtual environment and change to the working directory

source ~/R_1.8.0/venv_tf/bin/activate
cd ~/R_1.8.0/modelzoo/modelzoo/fc_mnist/tf/

Next, edit configs/params.yaml, making the following change. Cerebras requires that the data_dir be an absolute path.

--- a/modelzoo/fc_mnist/tf/configs/params.yaml
+++ b/modelzoo/fc_mnist/tf/configs/params.yaml
@@ -17,7 +17,7 @@ description: "FC-MNIST base model params"

 train_input:
     shuffle: True
-    data_dir: './tfds' # Place to store data
+    data_dir: '/software/cerebras/dataset/fc_mnist/tfds/' # Place to store data
     batch_size: 256
     num_parallel_calls: 0   # 0 means AUTOTUNE

Run a sample TensorFlow training job

export MODEL_DIR=model_dir
# deletion of the model_dir is only needed if sample has been previously run
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX pipeline --job_labels name=tf_fc_mnist --params configs/params.yaml --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_1.8.0/modelzoo/ --compile_dir /$(whoami) |& tee mytest.log

A successful fc_mnist TensorFlow training run should finish with output resembling the following:

INFO:tensorflow:global step 99900: loss = 0.10198974609375 (915.74 steps/sec)
INFO:tensorflow:global step 100000: loss = 0.0 (915.96 steps/sec)
INFO:root:Training complete. Completed 25600000 sample(s) in 109.17504906654358 seconds
INFO:root:Taking final checkpoint at step: 100000
INFO:root:Saving step 99999 in dataloader checkpoint
INFO:tensorflow:Saved checkpoint for global step 100000 in 3.9300642013549805 seconds: model_dir/model.ckpt-100000
INFO:root:Monitoring returned