Profiling Deep Learning Applications
We can use both a framework-specific (for example, PyTorch-specific) native profiler and the vendor-specific NVIDIA Nsys profiler to get high level profiling information and timeline of execution for an application. For kernel level information, we may use Nsight compute profiler. Refer to the respective documentation for more details: - Nsight System User Guide - Nsight Compute Documentation - Nsight Compute CLI - PyTorch Profiler
Example Usage
Both the nsys
and ncu
profiler commands take the following generic structure:
mpiexec
must be used:
These two commands show the basic command-line structure of deploying the
profilers. Below we discuss important use cases that are relevant in
large scale distributed profiling.
We can use nsys
to trace an application running on multiple ranks and
multiple nodes. A simple example, where we use a wrapper script to trace the
rank 0 on each node of a 2 node job running a PyTorch application is below:
-
NSYS_OPTS
: These are the options thatnsys
uses to trace data at different levels. An exhaustive list of options can be found in the nsys user guide. Note that,%q{PMI_RANK}
is essential to get a per rank profile. -
PROFRANK
: As implemented, this variable is set by the user to trace the rank of choice. For example, this wrapper will trace the rank 0 on each node. -
RANKCUTOFF
: This variable is Polaris specific. As we can run as many as 4 ranks per node (without using MPS), the first 2 nodes of a job will have 8 ranks running. This provides the upper cutoff of the label (in number) of ranks, beyond whichnsys
will not trace any rank. An user can change the number according to the number of maximum ranks running per node to set up how many ranks to be traced.nsys
will produce a profile (nsys-rep
file, by default) per traced rank.
To view the produced trace files, we need to use NVIDIA's Nsight Systems on the local machine
Getting Started, Download Nsys
Deployment
The wrapper above can be deployed using the following PBS job script:
Note that --env TMPDIR=${TEMPORARY_DIR}
is essential for nsys
to function correctly.
We can get kernel level information (for example roofline, Tensor Core usage) using NVIDIA's Nsight Compute profiler. Below is a simple wrapper script to show the usage.
This wrapper can be deployed as the nsys
example above. In the ncu
wrapper
we explicitly set the name of the kernel that we want to analyze
(a GEMM kernel in this case).
The exhaustive list of option to set the amount
of data collection can be found in the
command line section
of the documentation. Here we only show standard options, either of the three
could be chosen. Note that, invoking each option will lead to varying amounts
of time the profiler need to run. This will be important in setting the
requested walltime for your batch job.
ncu
will generate ncu-rep
files for each traced ranks, and we will need
NVIDIA's Nsight Compute system on the local machine.
The next step is to load the nsys-rep
files in the Nsight Systems GUI, and
the ncu-rep
files to the Nsight Compute GUI.
Single rank run
nsys
profiles
In the single rank case, we go to the top left, go file
--> open
and select
the file that we want to look at. For this particular example, we have focused
on the GPU activities. This activity is shown on the second column from the
left, named as CUDA HW ...
. If we expand the CUDA HW ...
tab, we find an
NCCL
tab. This tab shows the communicaltion library calls.
ncu
profiles
The primary qualitative distinction between the nsys-rep
files and the
ncu-rep
files is that, the nsys-rep
file presents data for the overall
execution of the application, whereas the ncu-rep
file presents data for the
execution of one particular kernel. Our setup here traces only one kernel, but
multiple kernels could be traced at a time, but that can become a time consuming
process.
We use the --stats=true --show-output=true
(see nsys_wrapper.sh
)
options while collecting the
nsys
data. As a result, we get a system-wide summary in our .OU
files
(if run with a job submission script, otherwise on the terminal), and find the
names of the kernels that has been called/used for compute and communication.
Often we would start with investigating the kernels that have been called the
most times or the ones where we spent the most time executing them. In this
particular instance we chose to analyze the gemm
kernels, which are related
to the matrix multiplication. The full name of this kernel is passed to the
ncu
profiler with the option -k
(see ncu_wrapper.sh
).
Loading the ncu-rep
files works similarly as the nsys-rep
files. Here, the
important tab is the Details
tab. We find that at the 3rd row from the top.
Under that tab we have the GPU Speed of Light Throughput
section. In this
section we can find plots showing GPU compute and memory usage. On the right
hand side of the tab, there is a menu bar which gives us the option to select
which plot to display, either the roofline plot or the compute-memory
throughput chart.
For a multi-rank run
nsys
profiles
In the case, where we have traced multiple ranks, whether from a single node or
multiple nodes nsys
GUI allow us to view the reports in a combined fashion on
a single timeline (same time-axis for both reports). This is done through the
"multi-report view", file
--> New multi-report view
or file
--> Open
and selecting however many reports we would like to see in a combined timeline,
nsys
prompts the user to allow for a "multi-report view". These can also be
viewed separately.
Profiler Options
In both cases, nsys
and ncu
we have used the standard option sets to
generate the profiles. The exhaustive list could be found in the respective
documentation pages:
- Nsight System User Guide
- Nsight Compute Documentation
- Nsight Compute CLI
There are many other information provided through these reports. Here we have discussed the way to view the high level information.
PyTorch Profiler
Using the PyTorch profiler requires changes in the application source code. A simple example is the following: