The TAU (Tuning and Analysis Utilities) Performance System is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, and Python. TAU gathers performance information while a program executes through instrumentation of functions, methods, basic blocks, and statements. The instrumentation consists of calls to TAU library routines, which can be incorporated into a program in several ways: - Automatic instrumentation of the code at the source level using the Program Database Toolkit (PDT) - Automatic instrumentation of the code using the compiler - Manual instrumentation using the instrumentation API - At runtime using library call interception - Dynamically using DyninstAPI - At runtime in the Java virtual machine
For more information on TAU instrumentation options, see: http://www.cs.uoregon.edu/Research/tau/docs/newguide/bk01ch01.html
- TAU Project Site
- TAU Instrumentation Methods
- TAU Compilation Options
- TAU Fortran Instrumentation FAQ
- TAU Performance Workshop 2018 Presentation
Compiling Your Application with TAU
While there are several methods of incorporating TAU instrumentation into a program, the two most common are automatic insertion (using the PDT source instrumentation method) and compiler instrumentation insertion. With either of these methods users must compile an application in a specific way so as to insert the TAU instrumentation that enables data collection. This involves invoking wrapper scripts that manage the compiling and linking process.
Build Time Module
Start by loading the TAU module
Two additional environment variables will also need to be set, but the settings will depend on the performance data to be collected and the method used for collection. Examples of settings for these values are:
The TAU_MAKEFILE option largely determines what type of information TAU collects during program execution. This variable must specify the name of the TAU configuration file. Some of the options are:
Makefile.tau-gnu-papi-mpi-pdt Makefile.tau-gnu-papi-mpi-pthread-pdt Makefile.tau-gnu-papi-pthread-pdt Makefile.tau-intel-184.108.40.206-papi-mpi-pthread-pdt Makefile.tau-intel-220.127.116.11-papi-ompt-v5-mpi-pdt-openmp Makefile.tau-intel-datascience_tensorflow_113-papi-mpi-pthread-python-pdt Makefile.tau-intel-mpi-pdt Makefile.tau-intel-papi-mpi-pdt Makefile.tau-intel-papi-ompt-tr6-mpi-pdt-openmp Makefile.tau-intel-papi-pdt ...
The TAU_OPTIONS affect how TAU inserts the instrumentation and various options are documented. Here is a list of commonly used TAU_OPTIONS.
-optVerbose self-explanatory -optNoRevert causes hard-failure when there is a TAU error, default behavior is to revert to an uninstrumented compile -optKeepFiles output your source file after it has been processed by the PDT parser -optPreProcess use if preprocess directives are present in Fortran code -optPdtCOpts pass special options to the PDT parse -optCompInst enables compiler-based instrumentation -optTauSelectFile enables selective instrumentation; cannot be used with compiler-based instrumentation -optShared linked against shared libraries, not recommended unless you know what you are doing
Once the TAU environment has been fully specified, an application may be compiled with TAU instrumentation by replacing the standard compiler names in the applications make file with the TAU compiler wrapper scripts:
If you receive the following type of error:
-optPdtCOpts=-c99 to TAU_OPTIONS
- Append -optPreProcess to TAU_OPTIONS if pre-process directives are present.
- Source-based instrumentation will not work for ENTRY points; a workaround is needed.
- Identify all relevant ENTRY points and exclude the parent function with a selective instrumentation file.
- Use compiler-based instrumentation instead.
- For Fortran77 codes, tau_f90.sh called the Fortran90 compiler. Thus, it will be necessary to add "-qfixed" to the Fortran compiler flags in your Makefile. If compilation fails with errors referencing syntax errors on lines that are comments, this indicates the use of "-qfixed." Comments using "C" in the first column are one instance in which "-qfixed" is required.
Running with TAU
Once an application has been built with TAU instrumentation, it is not necessary to do anything special in order to run it. Simply execute the application as usual and TAU will collect data and write it to one or more files. However, it should be noted that in many cases TAU collects a large amount of performance data that have a significant impact on your application's wall-clock time. It is always a good idea to compare the wall-clock from an instrumented binary to that of the pristine (un-instrumented) binary. If you see a large number of function calls, chances are there will be significant overhead.
Several runtime environment variables are available that can influence TAU's runtime behavior and limit the imposed overhead:
These environment variables are passed to TAU when you job is submitted with cobalt using the --env flag. () denotes default. TAU_VERBOSE=(0) or 1 Stderr contains TAU debugging information. TAU_THROTTLE=0 or (1) Attempts to reduce TAU overhead by turning off instrumentation for frequently called routines. TAU_COMPENSATE=(0) or 1 Attempts to approximate and subtract out the instrumentation overhead from the reported metrics. TAU_COMM_MATRIX=(0) or 1 Collects details information on point-to-point communication for MPI ranks. TAU_TRACE=(0) or 1 Collects tracing information instead of profile information. TAU_CALLPATH=(0) or 1 Generates a call path information for profiles. TAU_CALLPATH_DEPTH=N(2) Where N is an positive integer. TAU_PROFILE_FORMAT=merged Will merge all data into a single file in snapshot format: tauprofile.xml. Recommended using more than 10,000 cores. TAU_TRACK_HEAP=(0) or 1 Measures heap on function entry and exit. TAU_TRACK_MESSAGE=(0) or 1 Collects detailed information about message sizes.
TAU_THROTTLE = 0is needed for a full profile of your application. However, it is possible that you will incur a significant amount of overhead. If your application spends a significant percent of its runtime calling small routines repeatedly, e.g., 10 microsecond per call and more 1e5 calls, use either
TAU_THROTTLE=0or selective instrumentation to have a flat profile with manageable overhead (<25%).
AU_COMPENSATE setting will approximate the instrumentation overhead and subtract this from the metric reported. Check against timings from a pristine binary. For example, if the total exclusive time reported by TAU and the wall-clock time from the pristine binary are very different from the compensate option; it is not working effectively for your application.
TAU_COMM_MATRIX collects and writes the communication matrix (columns, actually) for each rank. Different values of
TAU_CALLPATH_DEPTH produce different types of information:
- 0 - communication matrix for application as a whole
- 1 - communication matrix broken down by function
- 2 - same as
TAU_CALLPATH_DEPTH=1but also includes parent function
Analyzing Your Data
The data collected by TAU will be written to one more file, by default to the application execution directory. This data may be viewed with the TAU command line tool pprof, or the GUI tools ParaProf and PerfExplorer. These tools may be run from the login nodes if you have an X-Windows environment on your local machine and X11 forwarding was set by logging in with "ssh –X." Alternatively, the GUI tools may be installed onto your local machine.
ParaProf is primarily for viewing a handful of profiles, while PerfExplorer is analyzing a larger collection of performance data (e.g., for weak scaling, strong scaling, etc.). PerfExplorer is highly recommended when large volumes of performance data are to be collected. In order to use PerfExplorer, first create a PerfDMF database.
To analyze a trace file (from TAU_TRACE=1), see the Jumpshot instructions at http://www.cs.uoregon.edu/Research/tau/docs/newguide/bk01ch04s03.html
Selective instrumentation is enabled by appending -optTauSelectFile=
It is also possible to generate a selective instrumentation file using Paraprof and a full flat profile obtained through automatic instrumentation.
If automatic instrumentation has too much overhead or too fine-grained details, an option to consider is performing light-weight/coarse-grained instrumentation by adding functional calls to the TAU timers directly in the source code. See the TAU API:
The most useful TAU API routines are those for starting and stopping timers. A text token needs to be provided for the timer:
Note: External libraries such as BLAS or LAPACK are not automatically instrumented. This must be performed manually or a TAU wrapper library must be created: