Skip to content

Early User Notes and Known Issues

Last Updated: 2025-01-09 13:00 CST

Early User Notes

Please check back here often for updates (indicated by the "Last Updated" timestamp above). As early production users encounter new issues and find solutions or workarounds, we will upate these notes. Aurora matures and becomes more stable over the first year of production, this section should become obsolete.

Outages and downtime – expectations

Expect unplanned outages and downtime. Stability of the Aurora system has improved significantly over the months leading up to production availability, but stability issues remain. Early users need to be proactive about verifying correctness, watching for hangs, and otherwise adopting work methods that are mindful of and resilient to instability.

Scheduling

The current queue policy for Aurora is setup based on experiences to-date to help maximize productive use of the machine by projects.

  • Initial goal for teams is to start testing at small scales, ensure correct results (and performance), and ramp up to generating scientific results in production campaigns.

  • Focus initially on making good use of the system with <=2048 nodes per job; key is to validate code+runs and start generating science results. Initially, the prod queue will only allow 2048 nodes max. Except for ESP projects, project teams will be required to email support to request running on more than 2048 nodes (with evidence that they are likely to succeed at larger scales).

Storage

Flare (Lustre file system)

This is the primary and most stable storage filesystem for now. It is still possible that heavy use may trigger significant lags and performance degradations, and possibly lead to compute nodes crashing. We will continue to monitor filesystem stability as production use ramps up. We encourage teams to start out easy on I/O (both amount and job size), if possible, and report issues.

DAOS (object store)

Performance of DAOS has been impressive, but we continue to experience crashes with large jobs, including loss of data. Projects may use it, but should not consider it stable or safe for long-term storage.

Grand/Eagle

These won’t be mounted on Aurora initially, but they might be mounted around May 2025, depending on feasibility. Similarly, Flare will not initially be mounted on Polaris. DTNs and Globus are the best means to transfer data between Polaris and Aurora.

Checkpointing

Checkpointing is absolutely essential. The mean time between application interrupts caused by system instability may be as short as an hour for larger jobs. The frequency of checkpointing is something that needs to be decided for each individual application based on scale of runs.

  • If checkpointing has minimal overhead, consider checkpointing once every 15 minutes.

  • If checkpointing has substantial overhead, then consider checkpointing every 30-60 minutes

  • It may be the case that the highest throughput initially will be with creating job dependency chains where scripts are able to 1) automatically restart from the latest available checkpoint file and 2) confirm that the prior run generated reasonable/correct results

Troubleshooting Common Issues

As always, INCITE and ALCC projects should Report all issues to Catalyst point of contact.

Ping Failures

Network instabilities may lead to some nodes or MPI ranks losing communication with others. If your job output shows an error message like these:

ping failed on x4707c6s4b0n0: Couldn't forward RPC ping(24c93b8c-3434-4fb5-a8f0-53cff4cbbe42) to child x4707c7s6b0n0.hostmgmt2707.cm.aurora.alcf.anl.gov: Resource temporarily unavailable

ping RPC timeout from x4212c7s0b0n0.hostmgmt2212.cm.aurora.alcf.anl.gov after 120s
then most likely your application will crash or hang. If you see these, the best action is to kill the job and re-run it (from the last checkpoint, if there has been one).

Hangs

There are multiple failure modes that can lead to jobs hanging. For known hardware or low-level software issues such as ping failures as discussed above, just restart the job.

To avoid a hung job running out all the requested wallclock time on all its nodes, we suggest devising ways to monitor job progress. For example, if your application regularly writes small output to logfile, then you could launch a “watcher” script that looks for that expected output and collect a stack trace and kill the job if it's been too long since progress was made. Please engage your Catalyst POC if you are interested in evaluating this for your application.

GPU Segfaults (a.k.a. "Page Faults")

Memory errors on the GPUs are caught when illegal accesses exceed a page boundary. When you see an error message indicating Unexpected page fault from GPU at <address>

The best tools for debugging these are gdb-oneapi and DDT, both of which allow debugging into GPU kernel threads and looking at GPU data structures. You may also dump and step through the PVC assembly code using the debuggers if helpful. It is possible that there remain bugs in the IGC compiler that produces invalid assembly code, though as always the most likely cause of seg faults is memory errors in application code. To use the debuggers effectively in GPU kernels, you should compile and link your application with -g -O0. Keep in mind that the IGC compilation of GPU kernels takes place during the link phase if you're using AoT compilation.

Known Issues

This is a collection of known issues that have been encountered during Aurora's early user phase. Documentation will be updated as issues are resolved. Users are encouraged to email [email protected] to report issues.

A known issues page can be found in the CELS Wiki space used for NDA content. Note that this page requires a JLSE Aurora early hw/sw resource account for access.

Runtime Errors

1. Cassini Event Queue overflow detected

Cassini Event Queue overflow detected errors may occur for certain MPI communications and may happen for a variety of reasons - software and hardware, job placement, job routing, and the state of the machine. Simply speaking, it means one of the network interfaces is getting messages too fast and cannot keep up with processing them.

libfabric:16642:1701636928::cxi:core:cxip_cq_eq_progress():531<warn> x4204c1s3b0n0: Cassini Event Queue overflow detected.

As a workaround, the following environment variables can be set to try alleviating the problem.

export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

The value of FI_CXI_DEFAULT_CQ_SIZE can be set to something larger if issues persist. This is directly impacted by the number of unexpected messages sent and so may need to be increased as the scale of the job increases.

It maybe be useful to use other libfabric environment settings. In particular, the setting below may be useful to try. These are what what Cray MPI sets by default Cray MPI libfabric Settings.

2. failed to convert GOTPCREL relocation

If you see

_libm_template.c:(.text+0x7): failed to convert GOTPCREL relocation against '__libm_acos_chosen_core_func_x'; relink with --no-relax
try linking with -flink-huge-device-code

3. SYCL Device Free Memory Query Error

Note that if you are querying the free memory on a device with the Intel SYCL extension "get_info();", you will need to set export ZES_ENABLE_SYSMAN=1. Otherwise you may see an error like:

x1921c1s4b0n0.hostmgmt2000.cm.americas.sgi.com 0: The device does not have the ext_intel_free_memory aspect -33 (PI_ERROR_INVALID_DEVICE)
x1921c1s4b0n0.hostmgmt2000.cm.americas.sgi.com 0: terminate called after throwing an instance of 'sycl::_V1::invalid_object_error'
  what():  The device does not have the ext_intel_free_memory aspect -33 (PI_ERROR_INVALID_DEVICE)

4. No VNIs available in internal allocator.

If you see an error like start failed on x4102c5s2b0n0: No VNIs available in internal allocator, pass --no-vni to mpiexec

5. PMIX ERROR: PMIX_ERR_NOT_FOUND and PMIX ERROR: PMIX_ERROR

When running on single node, you may observe this error message:

PMIX ERROR: PMIX_ERR_NOT_FOUND in file dstore_base.c at line 1567 
PMIX ERROR: PMIX_ERROR in file dstore_base.c at line 2334
These errors can be safely ignored.

Submitting Jobs

Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the comment field in the full job information for the job using the command qstat -xfw [JOBID] | grep comment. Some example comments follow.

comment = Job held by [USER] on Tue Feb 6 05:20:00 2024 and terminated
The user has placed the job on hold; user can qrls the job when ready for it to be queued again.

comment = Not Running: Queue not started. and terminated

User has submitted to a queue that is not currently running; user should qmove the job to an appropriate queue.

comment = job held, too many failed attempts to run

The job tried and failed to start. In this scenario, the user should find that their job was placed on hold. This does not indicate a problem the users' job script, but indicates PBS made several attempts to find a set of nodes to run the job and was not able too. Users can qdel the job and resubmit or qrls the job to try running it again.

comment = Not Running: Node is in an ineligible state: down and terminated

There are an insufficient number of nodes are online and free for the job to start

In the event of a node going down during a job, users may encounter messages such as ping failed on x4616c0s4b0n0: Application 047a3c9f-fb41-4595-a2ad-4a4d0ec1b6c1 not found. The node will likely have started a reboot and won't be included in jobs again until checks pass.

Use of the qsub -V flag (note: upper-case) is discouraged, as it can lead to startup failures. The following message (found via pbsnodes -l):

failed to acquire job resources; job startup aborted (jobid: [YOUR JOBID])
indicates such a failure. It is recommended to instead use -v (note: lower-case) and explicitly export any environment variables that your job may require.

To increase the chances that a large job does not terminate due to a node failure, you may choose to interactively route your MPI job around nodes that fail during your run. See this page on Working Around Node Failures for more information.

Other issues

  • Large number of Machine Check Events from the PVC, that causes nodes to panic and reboot.
  • HBM mode is not automatically validated. Jobs requiring flat memory mode should test by looking at numactl -H for 4 NUMA memory nodes instead of 16 on the nodes.
  • Application failures at single-node are tracked in the JLSE wiki/confluence page