This is a collection of known issues that have been encountered during Polaris's early user phase. Documentation will be updated as issues are resolved.
nsysprofiler packaged with
nvhpc/21.9in some cases appears to be presenting broken timelines with start times not lined up. The issue does not appear to be present when
cudatoolkit-standalone/11.2.2is used. We expect this to no longer be an issue once
nvhpc/22.5is made available as the default version.
PrgEnv-nvhpc/8.3.3, if you are using
nvccto indirectly invoke
nvc++and compiling C++17 code (as, for example, in building Kokkos via
nvcc_wrapper), you will get compilation errors with C++17 constructs. See our documentation on NVIDIA Compilers for a workaround.
PrgEnv-nvhpc/8.3.3currently loads the
nvhpc/21.9module, which erroneously has the following lines:In particular, the final line can cause issues for C-based projects (e.g. CMake may complain because the
setenv("CC","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc") setenv("CXX","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc++") setenv("FC","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvfortran") setenv("F90","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvfortran") setenv("F77","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvfortran") setenv("CC","cpp")
cppC preprocessor is not a compiler). We recommend running the following in such cases:
Cray MPICH may exhibit issues when MPI ranks call
fork()and are distributed across multiple nodes. The process may hang or throw a segmentation fault.
In particular, this can manifest in hangs with PyTorch+Horovod with a
DataLoaderwith multithreaded workers and distributed data parallel training on multiple nodes. We have built a module
conda/2022-09-08-hvd-ncclwhich includes a Horovod built without support for MPI. It uses NCCL for GPU-GPU communication and Gloo for coordination across nodes.
export IBV_FORK_SAFE=1may be a workaround for some manifestations of this bug; however it will incur memory registration overheads. It does not fix the hanging experienced with multithreaded dataloading in PyTorch+Horovod across multiple nodes with
conda/2022-09-08, however (instead prompting a segfault).
This incompatibility also may affect Parsl; see details in the Special notes for Polaris section of the Parsl page.
For batch job submissions, if the parameters within your submission script do not meet the parameters of any of the execution queues (
backfill-large) you might not receive the "Job submission" error on the command line at all, and the job will never appear in history
qstat -xu <username>(current bug in PBS). E.g. if a user submits a script to the
prodrouting queue requesting 10 nodes for 24 hours, exceeding "Time Max" of 6 hrs of the
smallexecution queue (which handles jobs with 10-24 nodes), then it may behave as if the job was never submitted.
Job scripts are copied to temporary locations after
qsuband any changes to the original script while the job is queued will not be reflected in the copied script. Furthermore,
-A <allocation name>when changing job properties. Currently, there is a request for a
qalter-like command to trigger a re-copy of the original script to the temporary location.