Parsl on Polaris
Parsl is a flexible and scalable parallel programming library for Python.
-- Parsl Documentation
For many applications, managing an ensemble of jobs into a workflow is a critical step that can easily become a performance bottleneck. Many tools exist to address this, of which parsl
is just one. On this page, we'll highlight some of the key pieces of information about parsl
that are relevant to Polaris. Parsl
is also extensively documented, has a dedicated Slack Channel, and a large community of users and developers beyond ALCF. We encourage you to engage with the parsl
community for support with parsl
specific questions, and for Polaris-specific questions or problems, please contact support@alcf.anl.gov.
Getting Parsl on Polaris
Polaris is newer than parsl
, and some changes to the source code were required to correctly use parsl
on Polaris. For that reason, on Polaris, a minimum parsl version of 1.3.0-dev
or higher is required on Polaris.
You can install parsl building off of the conda
modules. You have some flexibility in how you want to extend the conda
module to include parsl, but here is an example way to do it:
# Load the Conda Module (needed everytime you use parsl)
module load conda
conda activate
# Create a virtual env that uses the conda env as the system packages.
# Only do the next line on initial set up:
python -m venv --system-site-packages /path/to/your/virtualenv
# Load the virtual env (every time):
source /path/to/your/virtualenv/bin/activate
# Install parsl (only once)
pip install parsl==1.3.0.dev0
Using Parsl on Polaris
Parsl has a variety of system configurations available already, though as a new system, Polaris has not been included as of Fall 2022. As an example, we provide the configuration below:
from parsl.config import Config
# PBSPro is the right provider for Polaris:
from parsl.providers import PBSProProvider
# The high throughput executor is for scaling to HPC systems:
from parsl.executors import HighThroughputExecutor
# You can use the MPI launcher, but may want the Gnu Parallel launcher, see below
from parsl.launchers import MpiExecLauncher, GnuParallelLauncher
# address_by_hostname is best on polaris:
from parsl.addresses import address_by_hostname
# For checkpointing:
from parsl.utils import get_all_checkpoints
# Adjust your user-specific options here:
user_opts = {
"worker_init": "module load conda; conda activate; module load cray-hdf5; source /path/to/your/virtualenv/bin/activate",
"scheduler_options":"" ,
"account": "YOURACCOUNT",
"queue": "debug-scaling",
"walltime": "1:00:00",
"run_dir": "/lus/grand/projects/yourproject/yourrundir/"
"nodes_per_block": 3, # think of a block as one job on polaris, so to run on the main queues, set this >= 10
"cpus_per_node": 32, # Up to 64 with multithreading
"strategy": simple,
}
checkpoints = get_all_checkpoints(user_opts["run_dir"])
print("Found the following checkpoints: ", checkpoints)
config = Config(
executors=[
HighThroughputExecutor(
label="htex",
heartbeat_period=15,
heartbeat_threshold=120,
worker_debug=True,
max_workers=user_opts["cpus_per_node"],
cores_per_worker=1, # How many workers per core dictacts total workers per node
address=address_by_hostname(),
cpu_affinity="alternating",
prefetch_capacity=0,
provider=PBSProProvider(
launcher=MpiExecLauncher(),
# Which launcher to use? Check out the note below for some details. Try MPI first!
# launcher=GnuParallelLauncher(),
account=user_opts["account"],
queue=user_opts["queue"],
select_options="ngpus=4",
# PBS directives (header lines): for array jobs pass '-J' option
scheduler_options=user_opts["scheduler_options"],
# Command to be run before starting a worker, such as:
worker_init=user_opts["worker_init"],
# number of compute nodes allocated for each block
nodes_per_block=user_opts["nodes_per_block"],
init_blocks=1,
min_blocks=0,
max_blocks=1, # Can increase more to have more parallel jobs
cpus_per_node=user_opts["cpus_per_node"],
walltime=user_opts["walltime"]
),
),
],
checkpoint_files = checkpoints,
run_dir=user_opts["run_dir"],
checkpoint_mode = 'task_exit',
strategy=user_opts["strategy"],
retries=2,
app_cache=True,
)
Special notes for Polaris
On Polaris, we are currently investigating a hang that occurs in some parallel applications. The hang occurs under these circumstances, as far as we can tell at this time:
- The application is launched on multiple nodes using mpi
.
- The application uses fork
on each node to spawn processes
- (Likely, but unconfirmed) the application has some aspect of locking involved, perhaps not visible to the user.
Under these circumstances, we have observed that rank 0 will proceed as far as possible, but ranks 1 to N will hang. For parsl
, which uses python.multiprocessing
(which calls fork
), hangs can occur for some workloads on remote ranks. In this case, you can install GNU Parallel
instead and use that as a launcher in parsl
- you can install it into the virtual env as parsl is involved above. This will circumvent the hang issue if your parsl
workloads require fork
calls.