Running Jobs on Crux
Queues
There are five production queues you can target in your qsub (-q <queue name>
):
Queue Name | Node Min | Node Max | Time Min | Time Max | Notes |
---|---|---|---|---|---|
debug | 1 | 4 | 5 min | 2 hr | max 8 nodes in use by this queue at any given time; Only 8 nodes are exclusive (see Note below) |
workq-route | 1 | 512 | 5 min | 24 hrs | Routing queue; 100 jobs max per project; See below |
Note: The debug queue has 8 exclusively dedicated nodes.
workq-route
is a routing queue and routes your job to one of the following execution queues (currently just one):
Queue Name | Node Min | Node Max | Time Min | Time Max | Notes |
---|---|---|---|---|---|
workq | 1 | 512 | 5 min | 24 hrs | 20 jobs queue or running/10 jobs running per project |
Running MPI+OpenMP Applications
Note: For OpenMP-enabled applications, it is extremely important to set the number of OpenMP threads to an appropriate value. As on most systems, the default value for OMP_NUM_THREADS
is set to the maximum possible, which is 256 on the Crux compute nodes.
Once a submitted job is running, calculations can be launched on the compute nodes using mpiexec
to start an MPI application. Documentation is accessible via man mpiexec
, and some helpful options follow.
-n
total number of MPI ranks-ppn
number of MPI ranks per node--cpu-bind
CPU binding for application--depth
number of CPUs per rank (useful with--cpu-bind
)--env
set environment variables (--env OMP_NUM_THREADS=2
)--hostfile
indicate file with hostnames (the default is--hostfile $PBS_NODEFILE
)
A sample submission script with directives is below for a 4-node job with 8 MPI ranks on each node and 8 OpenMP threads per rank. Each hardware thread runs a single OpenMP thread since there are 64 hardware threads on the CPU (2 per core).
You can download and compile hello_affinity
from this link.
#!/bin/bash -l
#PBS -N AFFINITY
#PBS -l select=4:system=crux
#PBS -l place=scatter
#PBS -l walltime=0:10:00
#PBS -q debug
#PBS -A Catalyst # Replace with your project
#PBS -l filesystems=home:eagle
# MPI+OpenMP example w/ 64 MPI ranks per node and threads spread evenly across cores
# There are two 32-core CPUs on each node. This will run 32 MPI ranks per CPU, 2 OpenMP threads per rank, and each thread bound to a single core.
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS_PER_NODE=64 # Number of MPI ranks to spawn per node
NDEPTH=2 # Number of hardware threads per rank (i.e. spacing between MPI ranks)
NTHREADS=2 # Number of software threads per rank to launch (i.e. OMP_NUM_THREADS)
NTOTRANKS=$(( NNODES * NRANKS_PER_NODE ))
echo "NUM_OF_NODES= ${NNODES} TOTAL_NUM_RANKS= ${NTOTRANKS} RANKS_PER_NODE= ${NRANKS_PER_NODE} THREADS_PER_RANK= ${NTHREADS}"
# Change the directory to work directory, which is the directory you submit the job.
cd $PBS_O_WORKDIR
MPI_ARGS="-n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth "
OMP_ARGS="--env OMP_NUM_THREADS=${NTHREADS} --env OMP_PROC_BIND=true --env OMP_PLACES=cores "
mpiexec ${MPI_ARGS} ${OMP_ARGS} ./hello_affinity
The hello_affinity
program is a compiled C++ code, which is built via make clean ; make
in the linked directory after cloning the Getting Started repository.
Running Multiple MPI Applications on a Single Node
Multiple applications can be run simultaneously on a node by launching several mpiexec
commands and backgrounding them. For performance, it will likely be necessary to ensure that each application runs on a distinct set of CPU resources. One can provide a list of CPUs using the --cpu-bind
option to explicitly assign CPU resources on a node to each application. Output from the numactl --hardware
command is useful for understanding how to localize applications within NUMA domains on the two CPUs of each node.
In the example below, eight instances of the application are simultaneously running on a single node, with each application localized to a single NUMA domain. Each application here is bound to 16 CPU cores with a single process running on each core (i.e. no hyperthreads). In the first instance, the application is spawning 16 MPI ranks on cores 0-15 in the first CPU.
MPI_ARG="-n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE}"
OMP_ARG="--env OMP_NUM_THREADS=${NTHREADS} "
# Socket 0
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:0:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63 ./hello_affinity &
# Socket 1
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:96:97:98:99:100:101:102:103:104:105:106:107:108:109:110:111 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:112:113:114:115:116:117:118:119:120:121:122:123:124:125:126:127 ./hello_affinity &
wait
Running Multiple MPI Applications on Multiple Nodes
An important detail missing from the prior example was specifying the hostfile. When not specified, the default hostfile ${PBS_NODEFILE} is used for all invocations of mpiexec
, meaning all applications will include identical sets of nodes. This is fine for single-node jobs, but appropriate hostfiles need to be created and passed to mpiexec
when running applications across subsets of nodes in a large job.
The following example first splits the hostfile ${PBS_NODEFILE} into separate hostfiles each containing the requested number of nodes (in this case just 1 per file). The separate hostfiles are then used in each batch of mpiexec
calls to launch applications on different compute nodes.
# MPI example w/ multiple runs per batch job
NNODES=`wc -l < $PBS_NODEFILE`
# Settings for each run: 8 runs each with 16 MPI ranks per node spread evenly across specified subset of cores
NUM_NODES_PER_MPI=1
NRANKS_PER_NODE=16
NTHREADS=1
NTOTRANKS=$(( NUM_NODES_PER_MPI * NRANKS_PER_NODE ))
echo "NUM_OF_NODES= ${NNODES} NUM_NODES_PER_MPI= ${NUM_NODES_PER_MPI} TOTAL_NUM_RANKS= ${NTOTRANKS} RANKS_PER_NODE= ${NRANKS_PER_NODE} THREADS_PER_RANK= ${NTHREADS}"
# Increase value of suffix-length if more than 99 jobs
split --lines=${NUM_NODES_PER_MPI} --numeric-suffixes=1 --suffix-length=2 $PBS_NODEFILE local_hostfile.
for lh in local_hostfile*
do
echo "Launching mpiexec w/ ${lh}"
MPI_ARG="-n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --hostfile ${lh} "
OMP_ARG="--env OMP_NUM_THREADS=${NTHREADS} "
# Socket 0
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:0:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63 ./hello_affinity &
# Socket 1
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:96:97:98:99:100:101:102:103:104:105:106:107:108:109:110:111 ./hello_affinity &
mpiexec ${MPI_ARG} ${OMP_ARG} --cpu-bind list:112:113:114:115:116:117:118:119:120:121:122:123:124:125:126:127 ./hello_affinity &
sleep 1s
done
wait
rm -f local_hostfile.*
Ensemble examples
for several cases are provided to help users with crafting job submission scripts.
Compute Node Access to the Internet
Currently, the only access to the internet is via a proxy. Here are the proxy environment variables for Crux: