Skip to content

Argonne Leadership Computing Facility

Steps to Run a Model/Program

Getting Started

[This subsection is an adaption of]


Slurm is installed and running on all the CPU nodes. The coordination between a Cerebras system and the nodes in a Cerebras cluster is performed by Slurm. See section Job Queueing and Submission for more details.

Worker hostnames:

The worker nodes (see the first diagram in System Overview) for the cs2-01 cluster are cs2-01-med[2-9].
The worker nodes (see the first diagram in System Overview) for the cs2-02 cluster are cs2-02-med[1-7].
You may occasionally need to log into a specific worker node for debugging purposes.

CS_IP address of the Cerebras system:

The CS-2 systems can be accessed using the CS_IP environment variable. This is set automatically on login.
The CS_IP for cs2-01 is
The CS_IP for cs2-02 is

Running slurm jobs:

Cerebras includes two scripts for running slurm jobs.
csrun_cpu is for running a Cerebras compilation. By default it reserves a single entire worker node.
csrun_wse is for running a job on the wafer scale engine. By default it reserves five entire worker nodes, which are used to feed the dataset to the CS2 wafer.
csrun_cpu --help and csrun_wse --help will list the available options.
See section Job Queuing and Submission for more details.

Execution mode:

The cs2 system supports two modes of execution.
1. Pipeline mode (default mode)
Both cs2-01 and cs2-02 are currently configured for pipelined mode. This mode has more mature software support when compared to the weight streaming mode.
2. Weight streaming mode.(See the Weight Streaming Quickstart.)
Weight streaming mode uses the host memory of one or more dedicated worker nodes to store model weights, and supports larger models compared to pipelined mode.
Weight streaming mode is newly introduced in Rel 1.5, and supports only a limited number of model layers.

Running a training job on the wafer

Follow these instructions to compile and train the fc_mnist TensorFlow estimator example. This model is a couple of fully connected layers plus dropout and RELU.

cd ~/
mkdir ~/R1.5/
cp -r /software/cerebras/model_zoo/modelzoo ~/R1.5/modelzoo
cd ~/R1.5/modelzoo/fc_mnist/tf
csrun_wse python --mode train --cs_ip $CS_IP --max_steps 100000

You should see a training rate of about 1870 steps per second, and output that finishes with something similar to this:

INFO:tensorflow:Training finished with 25600000 samples in 53.424 seconds, 479188.55 samples/second.
INFO:tensorflow:Loss for final step: 0.0.

To separately compile and train,

# delete any existing compile artifacts and checkpoints
rm -r model_dir
csrun_cpu python --mode train --compile_only --cs_ip $CS_IP
csrun_wse python --mode train --cs_ip $CS_IP --max_steps 100000

The training will reuse an existing compilation if no changes were made that force a recompile, and will start from the newest checkpoint file if any. Compiles may be done while another job is using the wafer.

See also the current Cerebras quickstart documentation, that uses a clone of Cerebras's abbreviated public "reference implementations" github repo rather than the full modelzoo.

Running a training job on the wafer in weight streaming mode

No CS2-nodes are currently configured for weight streaming mode. This section is currently a placeholder.

If not already done, copy the modelzoo tree:

cd ~/
mkdir ~/R1.5/
cp -r /software/cerebras/model_zoo/modelzoo ~/R1.5/modelzoo
then change to the TensorFlow GPT2 directory:
cd ~/R1.5/modelzoo/transformers/tf/gpt2
then edit the two instances of data_dir in configs/params_gpt2_small_ws.yaml (or in a copy of that file) as follows:
<     data_dir: "./input/pile_pretraining_gpt/train_msl2048/"
>     data_dir: "/software/cerebras/dataset/transformers/owt/openwebtext/owt_tfrecords_gpt2_msl2048/train/"
<     data_dir: "./input/pile_pretraining_gpt/val_msl2048/"
>     data_dir: "/software/cerebras/dataset/transformers/owt/openwebtext/owt_tfrecords_gpt2_msl2048/val/"
csrun_wse --cyclic --total-nodes=4 --single-task-nodes=2 python-ws  -p configs/params_gpt2_small.yaml  -m train --model_dir gpt2_small_owt_2048 --cs_ip $CS_IP

Running a training job on the CPU

The examples in the modelzoo will run in CPU mode, either using the csrun_cpu script, or in a singularity shell as shown below.

Using csrun_cpu

To separately compile and train,

# delete any existing compile artifacts and checkpoints
rm -r model_dir
csrun_cpu python --mode train --compile_only
csrun_cpu python --mode train --max_steps 400

Note: If no cs_ip is specified, a training run will be in cpu mode.

Change the max steps for the training run command line to something smaller than the default so that the training completes in a reasonable amount of time. (CPU mode is >2 orders of magnitude slower for many examples.)

Using a singularity shell

This illustrates how to create a singularity container. The -B /opt:/opt is an illustrative example of how to bind a directory to a singularity container. (The singularity containers by default bind both one's home directory and /tmp, read/write.)

cd ~/R1.5/modelzoo/fc_mnist/tf
singularity shell -B /opt:/opt /software/cerebras/cs2-02/container/cbcore_latest.sif
or, on cs2-01,
cd ~/R1.5/modelzoo/fc_mnist/tf
singularity shell -B /opt:/opt /software/cerebras/cs2-01/container/cbcore_latest.sif

At the shell prompt for the container,

#rm -r model_dir
# compile and train on the CPUs
python --mode train --max_steps 1000
python --mode eval --eval_steps 1000
# validate_only is the first portion of a compile
python --mode train --validate_only
# remove the existing compile and training artifacts
rm -r model_dir
# compile_only does a compile but no training
python --mode train --compile_only

Type exit at the shell prompt to exit the container.