Balsam on Aurora
Balsam is a toolkit for managing large computational campaigns on hpc systems. Balsam helps users to execute large numbers of jobs, with inter-job dependencies, track job outcomes, and manage postprocessing analysis. The command line interface and Python API make it easy for users to adopt: after wrapping the command line for an application in a few lines of Python code, users can describe jobs with accompanying options. These jobs are stored persistently in the Balsam database. Balsam is especially well suited for executing large ensembles of MPI tasks with a variety of sizes.
A user's Balsam service consists of a Balsam Site process that runs on a login node that orchestrates the execution of work, and a Balsam Site directory space where job and workflow results are stored. When the user submits a batch job to PBS through Balsam, the Site process pulls Balsam jobs from the database and executes them within the PBS batch job, achieving high throughput while incurring only a single wait-time in the queue.
The full Balsam documentation covers all functionality for users, including additional examples, and describes the Balsam architecture for potential developers.
Setup and installation
Balsam requires Python 3.7+. To install Balsam on Aurora, first set up a virtual Python environment:
Alternatively, Balsam can be installed in a conda environment also with pip.
The Balsam command line tool will now be in your path. To get information on how to use the command line tool, you can type balsam --help
in your shell.
To use Balsam, users need an account on the Balsam server. Users can get an account by contacting the ALCF Help Desk. Once a user has an account, they can login and make a new site. A Balsam site is a project space for your workflow. You will be prompted to select what machine (Aurora) you are working on when creating a new site:
Aurora specific notes
In the Balsam configuration for Aurora, a Balsam gpu
refers to an Aurora node GPU tile. Setting the Balsam job option gpus_per_rank = 1
will place one rank per GPU tile. Setting gpus_per_rank = 2
will place one rank per GPU.
Simple MPI ensemble on Aurora with Balsam
Here is an example that runs an application hello_affinity
from our getting started guide in mpi-mode
which will execute the application with mpiexec
. We also show an example of executing an echo command that takes an argument and runs on a single GPU tile.
Warning
Ensembles of tasks launched with mpiexec
on multiple nodes are currently limited to 1000 total tasks run per batch job. This means when mpiexec
calls return, the nodes they used can refill only a limited number of times, rather than an arbitrary number of times like on Polaris. This is due to a known issue with Slingshot and will be fixed in the future. Users running MPI application ensembles on Aurora with Balsam should take this into account when configuring their workflows.
After execution of this script, your site will have two registered apps and several Balsam jobs. Use the Balsam CLI tool to query them:
To check apps registered in a site:
To check the status of jobs in the site:
To submit a batch job to PBS to execute the Balsam jobs in your site, you can do so at the command line from within your site directory:
-n
option) to the debug-scaling
queue in mpi mode (-j
option). Mpi-mode batch jobs like this one will execute applications with mpiexec
. The time limit for the batch job is set to 10 minutes (-t
option).
You can also submit jobs with the Python API:
To check the status of batch jobs that Balsam is tracking:
The standard output (stdout) will be written to each job's workdir in the data directory to a file called job.out
and can be accessed like this:
Batch jobs created by Balsam will have a name beginning with qlaunch
when queried with the PBS
command qstat
.
Balsam has additional features that will submit work to PBS elastically, a special app type for native python code, and a serial
job mode for executing tasks that are single core/gpu that do not require MPI launching. More information can be found in the Balsam documentation.
Troubleshooting
If Balsam is failing to submit batch jobs to PBS, check the settings.yml
file in the Balsam site directory and look for the section allowed_queues
. The queue you are submitting to must appear in this section of the settings. If it does not, add it and restart the site process with:
If the queue does appear, get more information about the batch jobs Balsam is submitting to PBS with: