Known Issues
This is a collection of known issues that have been encountered during Polaris's early user phase. Documentation will be updated as issues are resolved.
-
The
nsys
profiler packaged withnvhpc/21.9
in some cases appears to be presenting broken timelines with start times not lined up. The issue does not appear to be present whennsys
fromcudatoolkit-standalone/11.2.2
is used. We expect this to no longer be an issue oncenvhpc/22.5
is made available as the default version. -
With
PrgEnv-nvhpc/8.3.3
, if you are usingnvcc
to indirectly invokenvc++
and compiling C++17 code (as, for example, in building Kokkos vianvcc_wrapper
), you will get compilation errors with C++17 constructs. See our documentation on NVIDIA Compilers for a workaround. -
PrgEnv-nvhpc/8.3.3
currently loads thenvhpc/21.9
module, which erroneously has the following lines:In particular, the final line can cause issues for C-based projects (e.g. CMake may complain because thesetenv("CC","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc") setenv("CXX","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc++") setenv("FC","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvfortran") setenv("F90","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvfortran") setenv("F77","/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvfortran") setenv("CC","cpp")
cpp
C preprocessor is not a compiler). We recommend running the following in such cases: -
Cray MPICH may exhibit issues when MPI ranks call
fork()
and are distributed across multiple nodes. The process may hang or throw a segmentation fault.In particular, this can manifest in hangs with PyTorch+Horovod with a
DataLoader
with multithreaded workers and distributed data parallel training on multiple nodes. We have built a moduleconda/2022-09-08-hvd-nccl
which includes a Horovod built without support for MPI. It uses NCCL for GPU-GPU communication and Gloo for coordination across nodes.export IBV_FORK_SAFE=1
may be a workaround for some manifestations of this bug; however it will incur memory registration overheads. It does not fix the hanging experienced with multithreaded dataloading in PyTorch+Horovod across multiple nodes withconda/2022-09-08
, however (instead prompting a segfault).This incompatibility also may affect Parsl; see details in the Special notes for Polaris section of the Parsl page.
-
For batch job submissions, if the parameters within your submission script do not meet the parameters of any of the execution queues (
small
, ...,backfill-large
) you might not receive the "Job submission" error on the command line at all, and the job will never appear in historyqstat -xu <username>
(current bug in PBS). E.g. if a user submits a script to theprod
routing queue requesting 10 nodes for 24 hours, exceeding "Time Max" of 6 hrs of thesmall
execution queue (which handles jobs with 10-24 nodes), then it may behave as if the job was never submitted. -
Job scripts are copied to temporary locations after
qsub
and any changes to the original script while the job is queued will not be reflected in the copied script. Furthermore,qalter
requires-A <allocation name>
when changing job properties. Currently, there is a request for aqalter
-like command to trigger a re-copy of the original script to the temporary location.