Profiling Deep Learning Applications¶
We can use both a framework-specific (for example, PyTorch-specific) native profiler and the vendor-specific NVIDIA Nsys profiler to get high-level profiling information and a timeline of execution for an application. For kernel-level information, we may use Nsight Compute profiler. Refer to the respective documentation for more details:
Example Usage¶
Both the nsys
and ncu
profiler commands take the following generic structure:
If we want to launch the profiled application with MPI, then mpiexec
must be used:
These two commands show the basic command-line structure of deploying the profilers. Below we discuss important use cases that are relevant in large-scale distributed profiling.
We can use nsys
to trace an application running on multiple ranks and multiple nodes. A simple example, where we use a wrapper script to trace the rank 0 on each node of a 2-node job running a PyTorch application, is below:
There are a few important things to notice in the wrapper.
-
NSYS_OPTS
: These are the options thatnsys
uses to trace data at different levels. An exhaustive list of options can be found in the nsys user guide. Note that%q{PMI_RANK}
is essential to get a per-rank profile. -
PROFRANK
: As implemented, this variable is set by the user to trace the rank of choice. For example, this wrapper will trace the rank 0 on each node. -
RANKCUTOFF
: This variable is Polaris specific. As we can run as many as 4 ranks per node (without using MPS), the first 2 nodes of a job will have 8 ranks running. This provides the upper cutoff of the label (in number) of ranks, beyond whichnsys
will not trace any rank. A user can change the number according to the number of maximum ranks running per node to set up how many ranks to be traced.nsys
will produce a profile (nsys-rep
file, by default) per traced rank.
To view the produced trace files, we need to use NVIDIA's Nsight Systems on the local machine.
Getting Started, Download Nsys
Deployment¶
The wrapper above can be deployed using the following PBS job script:
Note that --env TMPDIR=${TEMPORARY_DIR}
is essential for nsys
to function correctly.
We can get kernel-level information (for example, roofline, Tensor Core usage) using NVIDIA's Nsight Compute profiler. Below is a simple wrapper script to show the usage.
This wrapper can be deployed as the nsys
example above. In the ncu
wrapper, we explicitly set the name of the kernel that we want to analyze (a GEMM kernel in this case). The exhaustive list of options to set the amount of data collection can be found in the command line section of the documentation. Here we only show standard options; either of the three could be chosen. Note that invoking each option will lead to varying amounts of time the profiler needs to run. This will be important in setting the requested walltime for your batch job.
ncu
will generate ncu-rep
files for each traced rank, and we will need NVIDIA's Nsight Compute system on the local machine.
The next step is to load the nsys-rep
files in the Nsight Systems GUI, and the ncu-rep
files to the Nsight Compute GUI.
Single Rank Run¶
nsys
profiles¶
In the single rank case, we go to the top left, go file
--> open
and select the file that we want to look at. For this particular example, we have focused on the GPU activities. This activity is shown on the second column from the left, named as CUDA HW ...
. If we expand the CUDA HW ...
tab, we find an NCCL
tab. This tab shows the communication library calls.
ncu
profiles¶
The primary qualitative distinction between the nsys-rep
files and the ncu-rep
files is that the nsys-rep
file presents data for the overall execution of the application, whereas the ncu-rep
file presents data for the execution of one particular kernel. Our setup here traces only one kernel, but multiple kernels could be traced at a time, but that can become a time-consuming process.
We use the --stats=true --show-output=true
(see nsys_wrapper.sh
) options while collecting the nsys
data. As a result, we get a system-wide summary in our .OU
files (if run with a job submission script, otherwise on the terminal), and find the names of the kernels that have been called/used for compute and communication. Often we would start with investigating the kernels that have been called the most times or the ones where we spent the most time executing them. In this particular instance, we chose to analyze the gemm
kernels, which are related to the matrix multiplication. The full name of this kernel is passed to the ncu
profiler with the option -k
(see ncu_wrapper.sh
).
Loading the ncu-rep
files works similarly as the nsys-rep
files. Here, the important tab is the Details
tab. We find that at the 3rd row from the top. Under that tab, we have the GPU Speed of Light Throughput
section. In this section, we can find plots showing GPU compute and memory usage. On the right-hand side of the tab, there is a menu bar which gives us the option to select which plot to display, either the roofline plot or the compute-memory throughput chart.
For a Multi-Rank Run¶
nsys
profiles¶
In the case where we have traced multiple ranks, whether from a single node or multiple nodes, nsys
GUI allows us to view the reports in a combined fashion on a single timeline (same time-axis for both reports). This is done through the "multi-report view", file
--> New multi-report view
or file
--> Open
and selecting however many reports we would like to see in a combined timeline, nsys
prompts the user to allow for a "multi-report view". These can also be viewed separately.
Profiler Options¶
In both cases, nsys
and ncu
, we have used the standard option sets to generate the profiles. The exhaustive list could be found in the respective documentation pages:
There is much other information provided through these reports. Here we have discussed the way to view the high-level information.
PyTorch Profiler¶
Using the PyTorch profiler requires changes in the application source code. A simple example is the following:
The procedure described above works for both single and multi-rank deployments.