Running Jobs on Polaris
Queues
There are five production queues you can target in your qsub (-q <queue name>
):
Queue Name | Node Min | Node Max | Time Min | Time Max | Notes |
---|---|---|---|---|---|
debug | 1 | 2 | 5 min | 1 hr | max 24 nodes in use by this queue at any given time; Only 8 nodes are exclusive (see Note below) |
debug-scaling | 1 | 10 | 5 min | 1 hr | max 1 job running/accruing/queued per-user |
prod | 10 | 496 | 5 min | 24 hrs | Routing queue; See below |
preemptable | 1 | 10 | 5 min | 72 hrs | Please be aware that jobs in the preemptable queue can be killed at any time if jobs are submitted to the demand queue. Max 20 jobs running/accruing/queued per-project; see Note below |
demand | 1 | 56 | 5 min | 1 hr | By request only; max 100 jobs running/accruing/queued per-project |
Note: Please be aware that jobs in the preemptable queue can be killed at any time if jobs are submitted to the demand queue.
Jobs in the demand queue take priority over jobs in the preemptable queue.
This means jobs in the preemptable queue may be preempted (killed without any warning) if there are jobs in the demand queue.
Unfortunately, there's always an inherent risk of jobs being killed when using the preemptable queue.
Please use the following command to view details of a queue: qstat -Qf <queuename>
To make your job rerunnable, add the following PBS directive: #PBS -r y
This will ensure your job will restart once the demand job is complete.
Note: The debug queue has 8 exclusively dedicated nodes. If there are free nodes in production, then debug jobs can take another 16 nodes for a total of 24.
prod
is a routing queue and routes your job to one of the following six execution queues:
Queue Name | Node Min | Node Max | Time Min | Time Max | Notes |
---|---|---|---|---|---|
small | 10 | 24 | 5 min | 3 hrs | |
medium | 25 | 99 | 5 min | 6 hrs | |
large | 100 | 496 | 5 min | 24 hrs | |
backfill-small | 10 | 24 | 5 min | 3 hrs | low priority, negative project balance |
backfill-medium | 25 | 99 | 5 min | 6 hrs | low priority, negative project balance |
backfill-large | 100 | 496 | 5 min | 24 hrs | low priority, negative project balance |
- Note 1: You cannot submit to these queues directly; you can only submit to the routing queue "
prod
". - Note 2: All of these queues have a limit of ten (10) jobs running/accruing per-project.
- Note 3: All of these queues have a limit of one hundred (100) jobs queued (not accruing score) per-project.
- Note 4: As of January 2023, it is recommended to submit jobs with a maximum node count of 476-486 nodes given current rates of downed nodes (larger jobs may sit in the queue indefinitely).
Running MPI+OpenMP Applications
Once a submitted job is running, calculations can be launched on the compute nodes using mpiexec
to start an MPI application. Documentation is accessible via man mpiexec
, and some helpful options follow.
-n
total number of MPI ranks-ppn
number of MPI ranks per node--cpu-bind
CPU binding for application--depth
number of CPUs per rank (useful with--cpu-bind
)--env
set environment variables (--env OMP_NUM_THREADS=2
)--hostfile
indicate file with hostnames (the default is--hostfile $PBS_NODEFILE
)
A sample submission script with directives is below for a 4-node job with 8 MPI ranks on each node and 8 OpenMP threads per rank. Each hardware thread runs a single OpenMP thread since there are 64 hardware threads on the CPU (2 per core).
You can download and compile hello_affinity
from this link.
#!/bin/bash -l
#PBS -N AFFINITY
#PBS -l select=4:ncpus=256
#PBS -l walltime=0:10:00
#PBS -q debug-scaling
#PBS -A Catalyst # Replace with your project
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS=8 # Number of MPI ranks to spawn per node
NDEPTH=8 # Number of hardware threads per rank (i.e. spacing between MPI ranks)
NTHREADS=8 # Number of software threads per rank to launch (i.e. OMP_NUM_THREADS)
NTOTRANKS=$(( NNODES * NRANKS ))
echo "NUM_OF_NODES= ${NNODES} TOTAL_NUM_RANKS= ${NTOTRANKS} RANKS_PER_NODE= ${NRANKS} THREADS_PER_RANK= ${NTHREADS}"
# Change the directory to work directory, which is the directory you submit the job.
cd $PBS_O_WORKDIR
mpiexec --np ${NTOTRANKS} -ppn ${NRANKS} -d ${NDEPTH} --cpu-bind depth -env OMP_NUM_THREADS=${NTHREADS} ./hello_affinity
Running GPU-enabled Applications
GPU-enabled applications will similarly run on the compute nodes using the above example script.
- The environment variable MPICH_GPU_SUPPORT_ENABLED=1
needs to be set if your application requires MPI-GPU support whereby the MPI library sends and receives data directly from GPU buffers. In this case, it will be important to have the craype-accel-nvidia80
module loaded both when compiling your application and during runtime to correctly link against a GPU Transport Layer (GTL) MPI library. Otherwise, you'll likely see GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
errors during runtime.
- If running on a specific GPU or subset of GPUs is desired, then the CUDA_VISIBLE_DEVICES
environment variable can be used. For example, if one only wanted an application to access the first two GPUs on a node, then setting CUDA_VISIBLE_DEVICES=0,1
could be used.
Binding MPI ranks to GPUs
The Cray MPI on Polaris does not currently support binding MPI ranks to GPUs. For applications that need this support, this instead can be handled by use of a small helper script that will appropriately set CUDA_VISIBLE_DEVICES
for each MPI rank. One example is available here where each MPI rank is similarly bound to a single GPU with round-robin assignment.
An example set_affinity_gpu_polaris.sh
script follows where GPUs are assigned round-robin to MPI ranks.
#!/bin/bash -l
num_gpus=4
# need to assign GPUs in reverse order due to topology
# See Polaris Device Affinity Information:
# https://www.alcf.anl.gov/support/user-guides/polaris/hardware-overview/machine-overview/index.html
gpu=$((${num_gpus} - 1 - ${PMI_LOCAL_RANK} % ${num_gpus}))
export CUDA_VISIBLE_DEVICES=$gpu
echo "RANK= ${PMI_RANK} LOCAL_RANK= ${PMI_LOCAL_RANK} gpu= ${gpu}"
exec "$@"
mpiexec
command like so.
mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth ./set_affinity_gpu_polaris.sh ./hello_affinity
Interactive Jobs on Compute Nodes
Here is how to submit an interactive job to, for example, edit/build/test an application on Polaris compute nodes:
This command requests 1 node for a period of 1 hour in the debug queue, requiring access to the /home and eagle filesystems. After waiting in the queue for a node to become available, a shell prompt on a compute node will appear. You may then start building applications and testing GPU affinity scripts on the compute node.
NOTE: If you want to ssh
or scp
to one of your assigned compute nodes, you will need to make sure your $HOME
directory and your $HOME/.ssh
directory permissions are both set to 700
.
Running Multiple MPI Applications on a Node
Multiple applications can be run simultaneously on a node by launching several mpiexec
commands and backgrounding them. For performance, it will likely be necessary to ensure that each application runs on a distinct set of CPU resources and/or targets specific GPUs. One can provide a list of CPUs using the --cpu-bind
option, which when combined with CUDA_VISIBLE_DEVICES
provides a user with specifying exactly which CPU and GPU resources to run each application on. In the example below, four instances of the application are simultaneously running on a single node. In the first instance, the application is spawning MPI ranks 0-7 on CPUs 24-31 and using GPU 0. This mapping is based on output from the nvidia-smi topo -m
command and pairs CPUs with the closest GPU.
export CUDA_VISIBLE_DEVICES=0
mpiexec -n 8 --ppn 8 --cpu-bind list:24:25:26:27:28:29:30:31 ./hello_affinity &
export CUDA_VISIBLE_DEVICES=1
mpiexec -n 8 --ppn 8 --cpu-bind list:16:17:18:19:20:21:22:23 ./hello_affinity &
export CUDA_VISIBLE_DEVICES=2
mpiexec -n 8 --ppn 8 --cpu-bind list:8:9:10:11:12:13:14:15 ./hello_affinity &
export CUDA_VISIBLE_DEVICES=3
mpiexec -n 8 --ppn 8 --cpu-bind list:0:1:2:3:4:5:6:7 ./hello_affinity &
wait
Compute Node Access to the Internet
Currently, the only access to the internet is via a proxy. Here are the proxy environment variables for Polaris:
export http_proxy="http://proxy.alcf.anl.gov:3128"
export https_proxy="http://proxy.alcf.anl.gov:3128"
export ftp_proxy="http://proxy.alcf.anl.gov:3128"
In the future, though we don't have a timeline on this because it depends on future features in Slingshot and internal software development, we intend to have public IP addresses be a schedulable resource. For instance, if only your head node needed public access, your select statement might look something like: -l select=1:pubnet=True+63
.
Controlling Where Your Job Runs
If you wish to have your job run on specific nodes, form your select like this: -l select=1:vnode=<node name1>+1:vnode=<node name2>...
. Obviously, that gets tedious for large jobs.
If you want to control the location of a few nodes, for example, 2 out of 64, but the rest don't matter, you can do something like this: -l select=1:vnode=<node name1>+1:vnode=<node name2>+62:system=foo
.
Every node has a PBS resource called tier0
with a rack identifier and tier1
with a dragonfly group identifier. If you want all your nodes grouped in a rack, you can add the group specifier -l select=8:system=foo,place=scatter:group=tier0
. If you wanted everything in the same dragonfly group, replace tier0
with tier1
. Note that you have to also explicitly specify the place when you use group. If you wanted a specific rack or dragonfly group instead of any of them, you are back to the select: -l select 10:tier0=x3001-g0
.
Network: Rack and Dragonfly Group Mappings
- Racks contain (7) 6U chassis; each chassis has 2 nodes for 14 nodes per rack
- The hostnames are of the form xRRPPc0sUUb[0|1]n0 where:
- RR is the row {30, 31, 32}
- PP is the position in the row {30 goes 1-16, 31 and 32 go 1-12}
- c is chassis and is always 0
- s stands for slot, but in this case is the RU in the rack and values are {1,7,13,19,25,31,37}
- b is BMC controller and is 0 or 1 (each node has its own BMC)
- n is node, but is always 0 since there is only one node per BMC
- So, 16+12+12 = 40 racks * 14 nodes per rack = 560 nodes.
- Note that in production group 9 (the last 4 racks) will be the designated on-demand racks
- The management racks are x3000 and X3100 and are dragonfly group 10
- The TDS rack is x3200 and is dragonfly group 11
- Each compute node will have a PBS resource named
tier0
which will be equal to the values in the table below. This allows you to group your jobs within a rack if you wish. There is also a resource calledtier1
which will be equal to the column headings. This allows you to group your jobs within a dragonfly group if you wish.
g0 | g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | g9 |
---|---|---|---|---|---|---|---|---|---|
x3001-g0 | x3005-g1 | x3009-g2 | x3013-g3 | x3101-g4 | x3105-g5 | x3109-g6 | x3201-g7 | x3205-g8 | x3209-g9 |
x3002-g0 | x3006-g1 | x3010-g2 | x3014-g3 | x3102-g4 | x3106-g5 | x3110-g6 | x3202-g7 | x3206-g8 | x3210-g9 |
x3003-g0 | x3007-g1 | x3011-g2 | x3015-g3 | x3103-g4 | x3107-g5 | x3111-g6 | x3203-g7 | x3207-g8 | x3211-g9 |
x3004-g0 | x3008-g1 | x3012-g2 | x3016-g3 | x3104-g4 | x3108-g5 | x3112-g6 | x3204-g7 | x3208-g8 | x3212-g9 |