Running a Model/Program
Job submission and queuing
Cerebras jobs are initiated and tracked automatically within the Python frameworks in modelzoo.common.pytorch.run_utils and modelzoo.common.tf.run_utils. These frameworks interact with the Cerebras cluster management node.
Jobs are launched from login nodes. If you expect a loss of an internet connection for any reason, for long-running jobs we suggest logging into a specific login node and using either screen or tmux to create persistent command line sessions. For details use:
Running jobs on the wafer
Follow these instructions to compile and train the
fc_mnist TensorFlow and PyTorch samples. These models are a couple of fully connected layers plus dropout and RELU.
Cerebras virtual environments
First, make virtual environments for Cerebras for PyTorch and/or TensorFlow.
See Customizing Environments for the procedures for making PyTorch and/or TensorFlow virtual environments for Cerebras.
If the environments are made in
~/R_1.9.1/, then they would be activated as follows:
Clone the Cerebras modelzoo
Running a Pytorch sample
Activate your PyTorch virtual environment, and change to the working directory
Next, edit configs/params.yaml, making the following changes:
If you want to have the sample download the dataset, you will need to specify absolute paths for the "data_dir"s.
Running a sample PyTorch training job
To run the sample:
export MODEL_DIR=model_dir # deletion of the model_dir is only needed if sample has been previously run if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi python run.py CSX --job_labels name=pt_smoketest --params configs/params.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_1.9.1/modelzoo --compile_dir /$(whoami) |& tee mytest.log
A successful fc_mnist PyTorch training run should finish with output resembling the following:
2023-05-15 16:05:54,510 INFO: | Train Device=xla:0, Step=9950, Loss=2.30234, Rate=157300.30 samples/sec, GlobalRate=26805.42 samples/sec 2023-05-15 16:05:54,571 INFO: | Train Device=xla:0, Step=10000, Loss=2.29427, Rate=125599.14 samples/sec, GlobalRate=26905.42 samples/sec 2023-05-15 16:05:54,572 INFO: Saving checkpoint at global step 10000 2023-05-15 16:05:59,734 INFO: Saving step 10000 in dataloader checkpoint 2023-05-15 16:06:00,117 INFO: Saved checkpoint at global step: 10000 2023-05-15 16:06:00,117 INFO: Training Complete. Completed 1280000 sample(s) in 53.11996841430664 seconds. 2023-05-15 16:06:04,356 INFO: Monitoring returned