General Purpose Timing Library (GPTL): A Tool for Characterizing
Performance of Parallel and Serial Applications
Jim Rosinski
NOAA/ESRL
303 497-6028
james.rosinski@noaa.gov
www.burningserver.net/rosinski
In addition to this manual approach, regions can also be defined automatically at function granularity if the compiler supports it. GNU, Intel, and Pathscale provide this functionality with the flag "-finstrument-functions". PGI compilers provide an equivalent flag "-Minstrument:functions". Manually defining regions in an application (i.e. adding "GPTLstart" and "GPTLstop" calls) can seamlessly be added to a code which is also auto-instrumented.
Another form of auto-profiling is available for parallel codes linked with an MPI library which supports the PMPI profiling layer. GPTL makes use of this profiling layer to intercept MPI primitives and gather statistics on number of calls, time taken, and bytes transferred. MPI_Init and MPI_Finalize calls are intercepted to automatically call GPTL initialization and print functions, respectively. This allows application codes to be instrumented and timing statistics printed with no modifications required at the source code level. The MPI auto-profiling feature can be turned on or off when the GPTL library is built, in order to avoid unnecessary overhead if auto-profiling of MPI routines is not needed.
If the PAPI [2]library is available, GPTL can access all available PAPI counters. Required PAPI function calls are made by GPTL. A set of PAPI-based derived events is also available. Examples of derived events include computational intensity and instructions per cycle.
Example output from auto-instrumenting the HPCC benchmark suite is shown below. Both floating point operation count (FP_OPS), and computational intensity (CI) are included in the report.
Function names on the left of the output are indented to indicate their parent, and depth in the call tree. An asterisk next to an entry means it has more than one parent. Other entries in this output show the number of invocations, number of recursive invocations, wallclock timing statistics, and PAPI-based information. In this example, HPL_daxpy produced 1.34e6 floating point operations, 177.06 MFlops/sec, and had a computational intensity (floating point ops per memory reference) of 0.440.Stats for thread 0: Called Recurse Wallclock max min FP_OPS e6_/_sec CI total 1 - 64.021 64.021 64.021 3.50e+08 5.47 7.20e-02 HPCC_Init 11 10 0.157 0.157 0.000 95799 0.61 8.90e-02 * HPL_pdinfo 120 118 0.019 0.018 0.000 96996 4.99 8.56e-02 * HPL_all_reduce 7 - 0.043 0.036 0.000 448 0.01 1.03e-02 * HPL_broadcast 21 - 0.041 0.036 0.000 126 0.00 6.72e-03 HPL_pdlamch 2 - 0.004 0.004 0.000 94248 21.21 1.13e-01 * HPL_fprintf 240 120 0.001 0.000 0.000 1200 0.93 6.67e-03 HPCC_InputFileInit 41 40 0.001 0.001 0.000 194 0.27 8.45e-03 ReadInts 2 - 0.000 0.000 0.000 12 3.00 1.61e-02 PTRANS 21 20 22.667 22.667 0.000 4.19e+07 1.85 3.19e-02 MaxMem 5 4 0.000 0.000 0.000 796 2.70 1.79e-02 * iceil_ 132 - 0.000 0.000 0.000 792 2.88 1.75e-02 * ilcm_ 14 - 0.000 0.000 0.000 84 2.71 1.71e-02 param_dump 18 12 0.000 0.000 0.000 84 0.82 7.05e-03 Cblacs_get 5 - 0.000 0.000 0.000 30 1.43 1.67e-02 Cblacs_gridmap 35 30 0.005 0.001 0.000 225 0.05 1.79e-03 * Cblacs_pinfo 7 1 0.000 0.000 0.000 40 3.08 1.54e-02 * Cblacs_gridinfo 60 50 0.000 0.000 0.000 260 2.28 2.10e-02 Cigsum2d 5 - 0.088 0.047 0.000 165 0.00 6.37e-03 pdmatgen 20 - 21.497 1.213 0.942 4.00e+07 1.86 3.08e-02 * numroc_ 96 - 0.000 0.000 0.000 576 2.87 1.69e-02 * setran_ 25 - 0.000 0.000 0.000 150 2.94 1.72e-02 * pdrand 3.7e+06 2e+06 15.509 0.041 0.000 1.72e+07 1.11 2.24e-02 xjumpm_ 57506 57326 0.219 0.030 0.000 230384 1.05 2.66e-02 jumpit_ 60180 40120 0.214 0.021 0.000 280840 1.32 2.18e-02 slboot_ 5 - 0.000 0.000 0.000 30 1.30 1.01e-02 Cblacs_barrier 10 5 0.481 0.167 0.000 50 0.00 3.26e-03 sltimer_ 10 - 0.000 0.000 0.000 614 3.05 1.90e-02 * dwalltime00 15 - 0.000 0.000 0.000 150 2.54 2.57e-02 * dcputime00 15 - 0.000 0.000 0.000 373 3.06 1.91e-02 * HPL_ptimer_cputime 17 - 0.000 0.000 0.000 170 2.66 2.29e-02 pdtrans 14 9 0.124 0.045 0.000 573505 4.61 1.36e-01 Cblacs_dSendrecv 12 8 0.115 0.042 0.000 56 0.00 2.24e-03 pdmatcmp 5 - 0.448 0.295 0.003 1.29e+06 2.87 2.94e-01 * HPL_daxpy 2596 - 0.008 0.000 0.000 1.34e+06 177.06 4.40e-01 * HPL_idamax 2966 - 0.007 0.000 0.000 767291 104.75 4.15e-01 ...
GPTL is thread-safe. When regions are invoked by multiple threads, per-thread statistics are automatically generated. The results are presented both in the hierarchical "call tree" format mentioned above, and also summarized across threads on a per-region basis.
The following simple Fortran code illustrates basic usage of the GPTL library. It is a manually-instrumented code, with an OpenMP section.
program omptest implicit none include 'gptl.inc' ! Fortran GPTL include file integer :: ret, iter integer, parameter :: nompiter = 2 ! Number of OMP threads ret = gptlsetoption (gptlabort_on_error, 1) ! Abort on GPTL error ret = gptlsetoption (gptloverhead, 0) ! Turn off overhead estimate ret = gptlsetutr (gptlnanotime) ! Set underlying timer ret = gptlinitialize () ! Initialize GPTL ret = gptlstart ('total') ! Start a timer !$OMP PARALLEL DO PRIVATE (iter) ! Threaded loop do iter=1,nompiter ret = gptlstart ('A') ! Start a timer ret = gptlstart ('B') ! Start another timer ret = gptlstart ('C') call sleep (iter) ! Sleep for "iter" seconds ret = gptlstop ('C') ! Stop a timer ret = gptlstart ('CC') ret = gptlstop ('CC') ret = gptlstop ('A') ret = gptlstop ('B') end do ret = gptlstop ('total') ret = gptlpr (0) ! Print timer stats ret = gptlfinalize () ! Clean up end program omptest
All calls to gptlsetoption() and gptlsetutr() are optional. They change the behavior of the library from its default values. An arbitrary number of these calls can occur prior to initializing GPTL. In this example, the first gptlsetoption() call instructs GPTL to abort whenever an error occurs. The default behavior is to return an error code. The second call turns off printing of statistics which estimate the overhead incurred by the library itself. The call to gptlsetutr() changes which underlying function to use when gathering wallclock timing statistics. The default is gettimeofday(). This routine is ubiquitous, but generally does not have very good granularity and can be expensive to call. On x86-based systems, a time stamp counter can be read via an instruction (rdtsc). It has far better granularity and overhead characteristics than gettimeofday(). Passing gptlnanotime to gptlsetutr() instructs GPTL to use rdtsc to gather wallclock timing statistics.
After all gptlsetoption() and gptlsetutr() calls, one call to gptlinitialize() is required. This initializes the library, in particular the aspects which guarantee thread safety. Generally it is best to build GPTL with pthreads support instead of OpenMP, since the PAPI library has a distinct preference for pthreads. OpenMP application codes can still be profiled in this case, since OpenMP is usually built on top of pthreads.
After gptlinitialize() has been called, an arbitrary sequence of gptlstart() and gptlstop() pairs can be coded by the user. In the above example, some of these occur within a threaded loop. Prior to program termination, a call to gptlpr() causes the library to print the current state of all regions that have been defined by these pairs of calls to gptlstart() and gptlstop(). In an MPI code, one would normally pass the MPI rank of the calling process to gptlpr(). The output file will be named timing.number, where "number" is the MPI rank. This provides a simple way to avoid namespace conflicts. Here's the output from running the simple example code from above:
Stats for thread 0: Called Recurse Wallclock max min total 1 - 2.000 2.000 2.000 A 1 - 1.000 1.000 1.000 B 1 - 1.000 1.000 1.000 C 1 - 1.000 1.000 1.000 CC 1 - 0.000 0.000 0.000 Total calls = 5 Total recursive calls = 0 Stats for thread 1: Called Recurse Wallclock max min A 1 - 2.000 2.000 2.000 B 1 - 2.000 2.000 2.000 C 1 - 2.000 2.000 2.000 CC 1 - 0.000 0.000 0.000 Total calls = 4 Total recursive calls = 0 Same stats sorted by timer for threaded regions: Thd Called Recurse Wallclock max min 000 A 1 - 1.000 1.000 1.000 001 A 1 - 2.000 2.000 2.000 SUM A 2 - 3.000 2.000 1.000 000 B 1 - 1.000 1.000 1.000 001 B 1 - 2.000 2.000 2.000 SUM B 2 - 3.000 2.000 1.000 000 C 1 - 1.000 1.000 1.000 001 C 1 - 2.000 2.000 2.000 SUM C 2 - 3.000 2.000 1.000 000 CC 1 - 0.000 0.000 0.000 001 CC 1 - 0.000 0.000 0.000 SUM CC 2 - 0.000 0.000 0.000
If the PAPI library is installed ([2]http://icl.cs.utk.edu/papi), GPTL also provides a convenient mechanism to access all available PAPI events. An event is enabled by calling gptlsetoption() with the appropriate counter. For example, to measure floating point operations, one would code the following:
The second argument "1" passed to gptlsetoption() in this example means to enable the option. Passing a zero would mean to disable the option.ret = gptlsetoption (PAPI_FPOPS, 1)
The GPTL library handles details of invoking appropriate PAPI calls properly. This includes tasks such as initializing PAPI, calling appropriate thread initialization routines, setting up an event list, and reading the counters.
In addtion to PAPI preset and native events, GPTL defines a set of derived events which are based on PAPI counters. The file gptl.h contains a list of available derived events. An example is computational intensity, defined as floating point operations per memory reference. Derived events are enabled in the same way as PAPI events. For example:
Of course these events can only be enabled if the PAPI counters which they require are available on the target architecture. On the XT4 and XT5 machines at ORNL, GPTL_CI is defined as PAPI_FP_OPS / PAPI_L1_DCA. GPTL figures out which PAPI counters need to be enabled to accommodate the requested set of PAPI and/or derived counters. For example, if GPTL_CI and PAPI_FP_OPS are requested, only PAPI_FP_OPS and PAPI_L1_DCA counters are enabled, since PAPI_FP_OPS is used in both calculations. In addition, GPTL will enable PAPI multiplexing if it is required to do so based on the set of requested counters. Multiplexing reduces the accuracy of PAPI results though.ret = gptlsetoption (GPTL_CI, 1)
Using these hooks, one can automatically generate GPTL start/stop pairs at function entry and exit points. In this way, only the main program needs to contain GPTL function calls. All other application routines will be automatically instrumented. In addition to providing function-level profiling, this is an easy way to generate a dynamic call tree for an entire application.
at function start, and__cyg_profile_func_enter (void *this_fn, void *call_site)
at function exit. GPTL uses these hooks to start and stop timers (and PAPI counters if enabled) for these regions. When the regions are printed, the names will be addresses rather than human-readable function names. A post-processing perl script is provided to do the translation. It is named hex2name.pl.__cyg_profile_func_exit (void *this_fn, void *call_site)
The following example shows how to enable auto-profiling in a simple C code. It uses PAPI to count total instructions, and the derived event instructions per cycle is also enabled. Note that function B has multiple parents, and GPTL reports the multiple parent information in the output produced by the call to GPTLpr_file().
main.c:
subs.c:#include <gptl.h> #include <papi.h> int main () { void do_work (void); int i, ret; ret = GPTLsetoption (GPTL_IPC, 1); // Count instructions per cycle ret = GPTLsetoption (PAPI_TOT_INS, 1); // Print total instructions ret = GPTLsetoption (GPTLoverhead, 0); // Don't print overhead estimate ret = GPTLinitialize (); // Initialize GPTL ret = GPTLstart ("main"); // Start a manual timer do_work (); // Do some work ret = GPTLstop ("main"); // Stop the manual timer ret = GPTLpr_file ("outfile"); // Write output to "outfile" }
#include <unistd.h> extern void A(void); extern void AA(void); extern void B(void); void do_work () { A (); AA (); B (); } void A () { B (); } void AA () { } void B () { sleep (1); }
% gcc -c main.c % gcc -finstrument-functions subs.c main.o -lgptl -lpapi % ./a.outNext, convert the auto-instrumented output to human-readable form:
% hex2name.pl a.out outfile > outfile.convertedOutput file outfile.converted looks like this:
Stats for thread 0: Called Recurse Wallclock max min IPC TOT_INS e6_/_sec main 1 - 2.000 2.000 2.000 2.81e-01 18060 0.01 do_work 1 - 2.000 2.000 2.000 2.61e-01 12547 0.01 A 1 - 1.000 1.000 1.000 3.01e-01 4958 0.00 * B 2 - 2.000 1.000 1.000 1.09e-01 2812 0.00 AA 1 - 0.000 0.000 0.000 7.77e-01 488 244.00 Total calls = 6 Total recursive calls = 0 Multiple parent info (if any) for thread 0: Columns are count and name for the listed child Rows are each parent, with their common child being the last entry, which is indented Count next to each parent is the number of times it called the child Count next to child is total number of times it was called by the listed parents 1 A 1 do_work 2 B
GPTL_IPC = PAPI_TOT_INS / PAPI_TOT_CYC;
Note the asterisk in front of region "B". This indicates that region "B" had multiple parents. It is presented as a child of region "A" because that is the first region that invoked it. Information about other parents is presented after the main call tree. It shows that region "B" had two parents, "A", and "do_work". Each parent invoked "B" once, for a total of 2 calls.
Stats for thread 0: Called Recurse Wallclock max min AVG_MPI_BYTES total 1 - 0.199 0.199 0.199 - MPI_Send 1 - 0.001 0.001 0.001 4.000e+05 sync_Recv 2 - 0.010 0.010 0.000 - MPI_Recv 2 - 0.041 0.041 0.000 4.000e+05 MPI_Ssend 1 - 0.022 0.022 0.022 4.000e+05 MPI_Sendrecv 1 - 0.029 0.029 0.029 8.000e+05 MPI_Irecv 2 - 0.000 0.000 0.000 4.000e+05 MPI_Iprobe 1 - 0.000 0.000 0.000 - MPI_Test 1 - 0.000 0.000 0.000 - MPI_Isend 2 - 0.000 0.000 0.000 4.000e+05 MPI_Wait 2 - 0.000 0.000 0.000 - MPI_Waitall 2 - 0.000 0.000 0.000 - MPI_Barrier 1 - 0.000 0.000 0.000 - sync_Bcast 1 - 0.000 0.000 0.000 - MPI_Bcast 1 - 0.000 0.000 0.000 4.000e+05 sync_Allreduce 1 - 0.000 0.000 0.000 - MPI_Allreduce 1 - 0.004 0.004 0.004 8.000e+05 sync_Gather 1 - 0.000 0.000 0.000 -
MPI process synchronization was enabled in this run, so the time spent synchronizing prior to the collectives is reported with the name of the primitive prepended with "sync_". In additon to the standard GPTL output of number of calls and wallclock stats, auto-profiled MPI routines have a column labeled AVG_MPI_BYTES. This is the average number of bytes of data transferred by the process per invocation.
High on the list of desired library enhancements is an option to output region information in XML format instead of only ASCII text. This would allow one to leverage the power of XML readers to expand or contract nested regions with the click of a mouse button. Such an ability would be useful for large application codes, particularly when navigating the call tree generated with auto-profiling enabled.
Currently GPTL only keeps track of the number of times a given region was invoked by each of its parents. It does not retain equivalent "as-called-by" statistics for wallclock time, or PAPI-based counter information. There are situations in which it would be beneficial to know for example that function "B" accumulated most of its time when it was called by function "A" rather than when it was called by function "C". Adding such functionality to GPTL should be straightoforward.
The list of MPI primitives supported by the PMPI GPTL layer needs to be completed. Currently only the most commonly used MPI routines are included. Completing the list is a simple task, though somewhat time-consuming owing to the shear number of MPI routines.