This is a collection of known issues that have been encountered during Aurora's early user phase. Documentation will be updated as issues are resolved. Users are encouraged to email email@example.com to report issues.
A known issues page can be found in the JLSE Wiki space used for NDA content. Note that this page requires a JLSE Aurora early hw/sw resource account for access.
Cassini Event Queue overflow detected.errors may occur for certain MPI communications and may happen for a variety of reasons - software and hardware, job placement, job routing, and the sate of the machine. Simply speaking, it means one of the network interfaces is getting messages too fast and cannot keep up to process them
As a workaround, the following environment variables can be set to try alleviating the problem.
The value of
FI_CXI_DEFAULT_CQ_SIZE can be set to something larger if issues persist. This is directly impacted by the number of unexpected messages sent and so may need to be increased as the scale of the job increases.
Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the
comment field in the full job information for the job using the command
qstat -xfw [JOBID] | grep comment. Some example comments follow.
qrls the job when ready for it to be queued again.
User has submitted to a queue that is not currently running; user should
qmove the job to an appropriate queue.
The job tried and failed to start. In this scenario, the user should find that their job was placed on hold. This does not indicate a problem the users' job script, but indicates PBS made several attempts to find a set of nodes to run the job and was not able too. Users can
qdel the job and resubmit or
qrls the job to try running it again.
There are an insufficient number of nodes are online and free for the job to start
In the event of a node going down during a job, users may encounter messages such as
ping failed on x4616c0s4b0n0: Application 047a3c9f-fb41-4595-a2ad-4a4d0ec1b6c1 not found. The node will likely have started a reboot and won't be included in jobs again until checks pass.
To increase the chances that a large job does not terminate due to a node failure, you may choose to interactively route your MPI job around nodes that fail during your run. See this page on Working Around Node Failures for more information.
- Interim Filesystem: The early access filesystem is not highly performant. Intermittent hangs or pauses should be expected - waiting for IO to complete is recommended and IO completions should pass without failure. Jobs requiring significant filesystem performance must be avoided at this time.
- Large number of Machine Check Events from the PVC, that causes nodes to panic and reboot.
- HBM mode is not automatically validated. Jobs requiring flat memory mode should test by looking at
numactl -Hfor 4 NUMA memory nodes instead of 16 on the nodes.
- Application failures at large node-count are being tracked in the CNDA Slack workspace. See this canvas table for more information and to document your case. ESP and ECP project members with access to Aurora should have access to the CNDA slack workspace. Contact firstname.lastname@example.org if you have have access to Aurora and belong to an ESP or ECP project, but are not in the CNDA Slack workspace.
- Application failures at single-node are tracked in the JLSE wiki/confluence page