General Purpose Timing Library (GPTL): A Tool for Characterizing Performance of Parallel and Serial Applications

Jim Rosinski
NOAA/ESRL
303 497-6028
james.rosinski@noaa.gov
www.burningserver.net/rosinski

ABSTRACT:
GPTL is an open source profiling library that reports a variety of performance statistics. Target codes may be parallel via threads and/or MPI. The code regions to be profiled can be hand-specified by the user, or GPTL can define them automatically at function-level granularity if the target application is built with an appropriate compiler flag. If the MPI library supports the PMPI profiling layer, GPTL can auto-profile various MPI primitives. Output is presented in a hierarchical fashion that preserves parent-child relationships of the profiled regions. If the PAPI library is available, GPTL can utilize it to gather hardware performance counter data. PAPI-based derived events such as computational intensity and instructions per cycle are also available. GPTL built with PAPI support is installed on the jaguar machines at ORNL.
KEYWORDS:
gptl call tree profile timing performance analysis

Slides for this paper

1 Introduction

The process of optimizing performance of multi-threaded, multi-tasked, computationally intensive applications can benefit from simple and easy to use tools which highlight hot spots and performance bottlenecks. Originally designed as part of the Community Atmosphere Model (CAM) at NCAR [1], the most basic functionality of the General Purpose Timing Library (GPTL) provides the ability to manually instrument Fortran, C, or C++ application codes. This is done through user definition of regions by a sequence of "GPTLstart" and "GPTLstop" calls. Timing statistics for these regions are reported when the user calls "GPTLpr", generally at the end of the run. Estimates of the overhead incurred by the library itself, and GPTL memory usage, are also printed. The set of regions can be nested, with the output presented in a hierarchical "call tree" form.

In addition to this manual approach, regions can also be defined automatically at function granularity if the compiler supports it. GNU, Intel, and Pathscale provide this functionality with the flag "-finstrument-functions". PGI compilers provide an equivalent flag "-Minstrument:functions". Manually defining regions in an application (i.e. adding "GPTLstart" and "GPTLstop" calls) can seamlessly be added to a code which is also auto-instrumented.

Another form of auto-profiling is available for parallel codes linked with an MPI library which supports the PMPI profiling layer. GPTL makes use of this profiling layer to intercept MPI primitives and gather statistics on number of calls, time taken, and bytes transferred. MPI_Init and MPI_Finalize calls are intercepted to automatically call GPTL initialization and print functions, respectively. This allows application codes to be instrumented and timing statistics printed with no modifications required at the source code level. The MPI auto-profiling feature can be turned on or off when the GPTL library is built, in order to avoid unnecessary overhead if auto-profiling of MPI routines is not needed.

If the PAPI [2]library is available, GPTL can access all available PAPI counters. Required PAPI function calls are made by GPTL. A set of PAPI-based derived events is also available. Examples of derived events include computational intensity and instructions per cycle.

Example output from auto-instrumenting the HPCC benchmark suite is shown below. Both floating point operation count (FP_OPS), and computational intensity (CI) are included in the report.

Stats for thread 0: Called Recurse Wallclock max min FP_OPS e6_/_sec CI total 1 - 64.021 64.021 64.021 3.50e+08 5.47 7.20e-02 HPCC_Init 11 10 0.157 0.157 0.000 95799 0.61 8.90e-02 * HPL_pdinfo 120 118 0.019 0.018 0.000 96996 4.99 8.56e-02 * HPL_all_reduce 7 - 0.043 0.036 0.000 448 0.01 1.03e-02 * HPL_broadcast 21 - 0.041 0.036 0.000 126 0.00 6.72e-03 HPL_pdlamch 2 - 0.004 0.004 0.000 94248 21.21 1.13e-01 * HPL_fprintf 240 120 0.001 0.000 0.000 1200 0.93 6.67e-03 HPCC_InputFileInit 41 40 0.001 0.001 0.000 194 0.27 8.45e-03 ReadInts 2 - 0.000 0.000 0.000 12 3.00 1.61e-02 PTRANS 21 20 22.667 22.667 0.000 4.19e+07 1.85 3.19e-02 MaxMem 5 4 0.000 0.000 0.000 796 2.70 1.79e-02 * iceil_ 132 - 0.000 0.000 0.000 792 2.88 1.75e-02 * ilcm_ 14 - 0.000 0.000 0.000 84 2.71 1.71e-02 param_dump 18 12 0.000 0.000 0.000 84 0.82 7.05e-03 Cblacs_get 5 - 0.000 0.000 0.000 30 1.43 1.67e-02 Cblacs_gridmap 35 30 0.005 0.001 0.000 225 0.05 1.79e-03 * Cblacs_pinfo 7 1 0.000 0.000 0.000 40 3.08 1.54e-02 * Cblacs_gridinfo 60 50 0.000 0.000 0.000 260 2.28 2.10e-02 Cigsum2d 5 - 0.088 0.047 0.000 165 0.00 6.37e-03 pdmatgen 20 - 21.497 1.213 0.942 4.00e+07 1.86 3.08e-02 * numroc_ 96 - 0.000 0.000 0.000 576 2.87 1.69e-02 * setran_ 25 - 0.000 0.000 0.000 150 2.94 1.72e-02 * pdrand 3.7e+06 2e+06 15.509 0.041 0.000 1.72e+07 1.11 2.24e-02 xjumpm_ 57506 57326 0.219 0.030 0.000 230384 1.05 2.66e-02 jumpit_ 60180 40120 0.214 0.021 0.000 280840 1.32 2.18e-02 slboot_ 5 - 0.000 0.000 0.000 30 1.30 1.01e-02 Cblacs_barrier 10 5 0.481 0.167 0.000 50 0.00 3.26e-03 sltimer_ 10 - 0.000 0.000 0.000 614 3.05 1.90e-02 * dwalltime00 15 - 0.000 0.000 0.000 150 2.54 2.57e-02 * dcputime00 15 - 0.000 0.000 0.000 373 3.06 1.91e-02 * HPL_ptimer_cputime 17 - 0.000 0.000 0.000 170 2.66 2.29e-02 pdtrans 14 9 0.124 0.045 0.000 573505 4.61 1.36e-01 Cblacs_dSendrecv 12 8 0.115 0.042 0.000 56 0.00 2.24e-03 pdmatcmp 5 - 0.448 0.295 0.003 1.29e+06 2.87 2.94e-01 * HPL_daxpy 2596 - 0.008 0.000 0.000 1.34e+06 177.06 4.40e-01 * HPL_idamax 2966 - 0.007 0.000 0.000 767291 104.75 4.15e-01 ...
Function names on the left of the output are indented to indicate their parent, and depth in the call tree. An asterisk next to an entry means it has more than one parent. Other entries in this output show the number of invocations, number of recursive invocations, wallclock timing statistics, and PAPI-based information. In this example, HPL_daxpy produced 1.34e6 floating point operations, 177.06 MFlops/sec, and had a computational intensity (floating point ops per memory reference) of 0.440.

GPTL is thread-safe. When regions are invoked by multiple threads, per-thread statistics are automatically generated. The results are presented both in the hierarchical "call tree" format mentioned above, and also summarized across threads on a per-region basis.

2 Basic usage

The following simple Fortran code illustrates basic usage of the GPTL library. It is a manually-instrumented code, with an OpenMP section.

program omptest implicit none include 'gptl.inc' ! Fortran GPTL include file integer :: ret, iter integer, parameter :: nompiter = 2 ! Number of OMP threads ret = gptlsetoption (gptlabort_on_error, 1) ! Abort on GPTL error ret = gptlsetoption (gptloverhead, 0) ! Turn off overhead estimate ret = gptlsetutr (gptlnanotime) ! Set underlying timer ret = gptlinitialize () ! Initialize GPTL ret = gptlstart ('total') ! Start a timer !$OMP PARALLEL DO PRIVATE (iter) ! Threaded loop do iter=1,nompiter ret = gptlstart ('A') ! Start a timer ret = gptlstart ('B') ! Start another timer ret = gptlstart ('C') call sleep (iter) ! Sleep for "iter" seconds ret = gptlstop ('C') ! Stop a timer ret = gptlstart ('CC') ret = gptlstop ('CC') ret = gptlstop ('A') ret = gptlstop ('B') end do ret = gptlstop ('total') ret = gptlpr (0) ! Print timer stats ret = gptlfinalize () ! Clean up end program omptest

All calls to gptlsetoption() and gptlsetutr() are optional. They change the behavior of the library from its default values. An arbitrary number of these calls can occur prior to initializing GPTL. In this example, the first gptlsetoption() call instructs GPTL to abort whenever an error occurs. The default behavior is to return an error code. The second call turns off printing of statistics which estimate the overhead incurred by the library itself. The call to gptlsetutr() changes which underlying function to use when gathering wallclock timing statistics. The default is gettimeofday(). This routine is ubiquitous, but generally does not have very good granularity and can be expensive to call. On x86-based systems, a time stamp counter can be read via an instruction (rdtsc). It has far better granularity and overhead characteristics than gettimeofday(). Passing gptlnanotime to gptlsetutr() instructs GPTL to use rdtsc to gather wallclock timing statistics.

After all gptlsetoption() and gptlsetutr() calls, one call to gptlinitialize() is required. This initializes the library, in particular the aspects which guarantee thread safety. Generally it is best to build GPTL with pthreads support instead of OpenMP, since the PAPI library has a distinct preference for pthreads. OpenMP application codes can still be profiled in this case, since OpenMP is usually built on top of pthreads.

After gptlinitialize() has been called, an arbitrary sequence of gptlstart() and gptlstop() pairs can be coded by the user. In the above example, some of these occur within a threaded loop. Prior to program termination, a call to gptlpr() causes the library to print the current state of all regions that have been defined by these pairs of calls to gptlstart() and gptlstop(). In an MPI code, one would normally pass the MPI rank of the calling process to gptlpr(). The output file will be named timing.number, where "number" is the MPI rank. This provides a simple way to avoid namespace conflicts. Here's the output from running the simple example code from above:

Stats for thread 0: Called Recurse Wallclock max min total 1 - 2.000 2.000 2.000 A 1 - 1.000 1.000 1.000 B 1 - 1.000 1.000 1.000 C 1 - 1.000 1.000 1.000 CC 1 - 0.000 0.000 0.000 Total calls = 5 Total recursive calls = 0 Stats for thread 1: Called Recurse Wallclock max min A 1 - 2.000 2.000 2.000 B 1 - 2.000 2.000 2.000 C 1 - 2.000 2.000 2.000 CC 1 - 0.000 0.000 0.000 Total calls = 4 Total recursive calls = 0 Same stats sorted by timer for threaded regions: Thd Called Recurse Wallclock max min 000 A 1 - 1.000 1.000 1.000 001 A 1 - 2.000 2.000 2.000 SUM A 2 - 3.000 2.000 1.000 000 B 1 - 1.000 1.000 1.000 001 B 1 - 2.000 2.000 2.000 SUM B 2 - 3.000 2.000 1.000 000 C 1 - 1.000 1.000 1.000 001 C 1 - 2.000 2.000 2.000 SUM C 2 - 3.000 2.000 1.000 000 CC 1 - 0.000 0.000 0.000 001 CC 1 - 0.000 0.000 0.000 SUM CC 2 - 0.000 0.000 0.000

2.1 Explanation of results

Region names are listed on the far left. A "region" is defined in the application by calling GPTLstart(), then GPTLstop() for the same input (character string) argument. Parent-child relationships between the regions are depicted by indenting the regions appropriately. In the example, we see that region "A" was contained in "total", "B" contained in "A", and regions "C" and "CC" both contained in "B".

Reading across the output from left to right, the next column is labelled "Called". This is the number of times the region was invoked. If any regions were called recursively, that information is printed next. In this case there were no recursive calls, so just a "-" is printed. Total wallclock time for each region is printed next, followed by the max and min values for any single invocation. In this simple example each region was called only once, so "Wallclock", "max", and "min" are all the same.

Since this is a threaded code run with OMP_NUM_THREADS=2, statistics for the second thread are also printed. This output starts at "Stats for thread 1:" The output shows that thread 1 participated in the computations for regions "A", "B", "C", and "CC", but not "total". This is reflected in the code itself, since only the master thread was active when start and stop calls were made for region "total".

After the per-thread statistics section, the same information is repeated, sorted by region name if more than one thread was active. This section is delimited by the string "Same stats sorted by timer for threaded regions:". This region presentation order makes it easier to inspect for load balance across threads. The leftmost column is thread number, and the region names are not indented. A sum across threads for each region is also printed, and labeled "SUM".

3 PAPI interface and PAPI-based derived counters

If the PAPI library is installed ([2]http://icl.cs.utk.edu/papi), GPTL also provides a convenient mechanism to access all available PAPI events. An event is enabled by calling gptlsetoption() with the appropriate counter. For example, to measure floating point operations, one would code the following:

ret = gptlsetoption (PAPI_FPOPS, 1)
The second argument "1" passed to gptlsetoption() in this example means to enable the option. Passing a zero would mean to disable the option.

The GPTL library handles details of invoking appropriate PAPI calls properly. This includes tasks such as initializing PAPI, calling appropriate thread initialization routines, setting up an event list, and reading the counters.

In addtion to PAPI preset and native events, GPTL defines a set of derived events which are based on PAPI counters. The file gptl.h contains a list of available derived events. An example is computational intensity, defined as floating point operations per memory reference. Derived events are enabled in the same way as PAPI events. For example:

ret = gptlsetoption (GPTL_CI, 1)
Of course these events can only be enabled if the PAPI counters which they require are available on the target architecture. On the XT4 and XT5 machines at ORNL, GPTL_CI is defined as PAPI_FP_OPS / PAPI_L1_DCA. GPTL figures out which PAPI counters need to be enabled to accommodate the requested set of PAPI and/or derived counters. For example, if GPTL_CI and PAPI_FP_OPS are requested, only PAPI_FP_OPS and PAPI_L1_DCA counters are enabled, since PAPI_FP_OPS is used in both calculations. In addition, GPTL will enable PAPI multiplexing if it is required to do so based on the set of requested counters. Multiplexing reduces the accuracy of PAPI results though.

4 Auto-profiling

Hand-instrumenting many regions of large application codes can be tedious. And inserting ifdefs around "start" and "stop" calls to enable or disable timing of regions can unduly complicate the application source code. An alternative is to use the auto-profiling hooks provided by Intel, GNU, Pathscale, and PGI compilers. The compiler flag which enables auto-profiling with the GNU, Intel and Pathscale Fortran, C, and C++ compilers is -finstrument-functions. The equivalent PGI flag is -Minstrument:functions.

Using these hooks, one can automatically generate GPTL start/stop pairs at function entry and exit points. In this way, only the main program needs to contain GPTL function calls. All other application routines will be automatically instrumented. In addition to providing function-level profiling, this is an easy way to generate a dynamic call tree for an entire application.

4.1 Converting addresses to names

Building the target application with the appropriate auto-profiling flag causes the compiler to generate calls to
__cyg_profile_func_enter (void *this_fn, void *call_site)

at function start, and
__cyg_profile_func_exit (void *this_fn, void *call_site)

at function exit. GPTL uses these hooks to start and stop timers (and PAPI counters if enabled) for these regions. When the regions are printed, the names will be addresses rather than human-readable function names. A post-processing perl script is provided to do the translation. It is named hex2name.pl.

The following example shows how to enable auto-profiling in a simple C code. It uses PAPI to count total instructions, and the derived event instructions per cycle is also enabled. Note that function B has multiple parents, and GPTL reports the multiple parent information in the output produced by the call to GPTLpr_file().

main.c:

#include <gptl.h> #include <papi.h> int main () { void do_work (void); int i, ret; ret = GPTLsetoption (GPTL_IPC, 1); // Count instructions per cycle ret = GPTLsetoption (PAPI_TOT_INS, 1); // Print total instructions ret = GPTLsetoption (GPTLoverhead, 0); // Don't print overhead estimate ret = GPTLinitialize (); // Initialize GPTL ret = GPTLstart ("main"); // Start a manual timer do_work (); // Do some work ret = GPTLstop ("main"); // Stop the manual timer ret = GPTLpr_file ("outfile"); // Write output to "outfile" }
subs.c:
#include <unistd.h>

extern void A(void);
extern void AA(void);
extern void B(void);

void do_work ()
{
  A ();
  AA ();
  B ();
}

void A ()
{
  B ();
}

void AA ()
{
}

void B ()
{
  sleep (1);
}
Compile all but main.c with auto-instrumentation, then link and run. Useful auto-instrumentation of the main program is not possible, because the call to GPTLinitialize() must be done manually and needs to preceed all calls to GPTLstart() and GPTLstop().
% gcc -c main.c
% gcc -finstrument-functions subs.c main.o -lgptl -lpapi
% ./a.out
Next, convert the auto-instrumented output to human-readable form:
% hex2name.pl a.out outfile > outfile.converted
Output file outfile.converted looks like this:
Stats for thread 0: Called Recurse Wallclock max min IPC TOT_INS e6_/_sec main 1 - 2.000 2.000 2.000 2.81e-01 18060 0.01 do_work 1 - 2.000 2.000 2.000 2.61e-01 12547 0.01 A 1 - 1.000 1.000 1.000 3.01e-01 4958 0.00 * B 2 - 2.000 1.000 1.000 1.09e-01 2812 0.00 AA 1 - 0.000 0.000 0.000 7.77e-01 488 244.00 Total calls = 6 Total recursive calls = 0 Multiple parent info (if any) for thread 0: Columns are count and name for the listed child Rows are each parent, with their common child being the last entry, which is indented Count next to each parent is the number of times it called the child Count next to child is total number of times it was called by the listed parents 1 A 1 do_work 2 B

4.1.1 Explanation of the above output

PAPI event "Total instructions executed" (PAPI_TOT_INS) and derived event "Instructions per cycle" (GPTL_IPC) were enabled. To compute instructions per cycle, GPTL made the PAPI library call to count total cycles (PAPI_TOT_CYC) in addition to the already-enabled event PAPI_TOT_INS. When GPTLpr_file() was called, it computed:
      GPTL_IPC = PAPI_TOT_INS / PAPI_TOT_CYC;

Note the asterisk in front of region "B". This indicates that region "B" had multiple parents. It is presented as a child of region "A" because that is the first region that invoked it. Information about other parents is presented after the main call tree. It shows that region "B" had two parents, "A", and "do_work". Each parent invoked "B" once, for a total of 2 calls.

4.2 Auto-profiling MPI calls

When the underlying MPI library supports the PMPI profiling layer (e.g. mpich allows this), GPTL can be configured to intercept MPI calls and use its own infrastructure to gather data about number of invocations, time spent in the routine, and total bytes transferred. When the MPI calls are auto-instrumented, MPI_Init() is configured to call GPTLinitialize(), and MPI_Finalize() is configured to call GPTLpr(). In this way, application codes may be instrumented and results reported without any modifications to user source.

Finally, GPTL has an option to synchronize MPI collectives prior to invocation. Passing the argument GPTLsync_mpi to GPTLsetoption() instructs GPTL to call and time MPI_Barrier() prior to all collectives. In this way the actual time reported transferring data can be segregated from the time spent waiting on MPI process synchronization. Example GPTL output from a small code with various MPI collectives auto-instrumented is shown below.

Stats for thread 0: Called Recurse Wallclock max min AVG_MPI_BYTES total 1 - 0.199 0.199 0.199 - MPI_Send 1 - 0.001 0.001 0.001 4.000e+05 sync_Recv 2 - 0.010 0.010 0.000 - MPI_Recv 2 - 0.041 0.041 0.000 4.000e+05 MPI_Ssend 1 - 0.022 0.022 0.022 4.000e+05 MPI_Sendrecv 1 - 0.029 0.029 0.029 8.000e+05 MPI_Irecv 2 - 0.000 0.000 0.000 4.000e+05 MPI_Iprobe 1 - 0.000 0.000 0.000 - MPI_Test 1 - 0.000 0.000 0.000 - MPI_Isend 2 - 0.000 0.000 0.000 4.000e+05 MPI_Wait 2 - 0.000 0.000 0.000 - MPI_Waitall 2 - 0.000 0.000 0.000 - MPI_Barrier 1 - 0.000 0.000 0.000 - sync_Bcast 1 - 0.000 0.000 0.000 - MPI_Bcast 1 - 0.000 0.000 0.000 4.000e+05 sync_Allreduce 1 - 0.000 0.000 0.000 - MPI_Allreduce 1 - 0.004 0.004 0.004 8.000e+05 sync_Gather 1 - 0.000 0.000 0.000 -

MPI process synchronization was enabled in this run, so the time spent synchronizing prior to the collectives is reported with the name of the primitive prepended with "sync_". In additon to the standard GPTL output of number of calls and wallclock stats, auto-profiled MPI routines have a column labeled AVG_MPI_BYTES. This is the average number of bytes of data transferred by the process per invocation.

5 Internals

The basic data structure internal to GPTL is a linked list. Contents of each entry in the linked list are defined below. Most of the entries should be self-explanatory. GPTL needs to keep track of such quantities as "current accumulated wallclock time", as well as "last wallclock timestamp", values of PAPI counters, number of calls to the timer, etc.

typedef struct TIMER { char name[MAX_CHARS+1]; /* timer name (user input) */ #ifdef HAVE_PAPI Papistats aux; /* PAPI stats */ #endif Wallstats wall; /* wallclock stats */ Cpustats cpu; /* cpu stats */ unsigned long count; /* number of start/stop calls */ unsigned long nrecurse; /* number of recursive start/stop calls */ void *address; /* address of timer: used only by _instr routines */ struct TIMER *next; /* next timer in linked list */ struct TIMER **parent; /* array of parents */ struct TIMER **children; /* array of children */ int *parent_count; /* array of call counts, one for each parent */ unsigned int recurselvl; /* recursion level */ unsigned int nchildren; /* number of children */ unsigned int nparent; /* number of parents */ unsigned int norphan; /* number of times this timer was an orphan */ int num_desc; /* number of descendants */ bool onflg; /* timer currently on or off */ } Timer;

Every time a new timer is started (i.e. first call to GPTLstart()), a new entry is added to the end of the list. This approach allows specification of an arbitrary number of user-defined (or compiler-defined) timers. In order to determine whether a timer is "new", a simple hash function is used to generate an index into an array of pointers to existing timers. This speed enhancement dramatically improves the performance of GPTL itself by avoiding having to traverse a linked list on every call to GPTLstart() or GPTLstop().

Thread-safety is attained by maintaining separate linked lists and hash tables for each thread. Separate per-thread information is then reported when timing information is written to the output file.

Hash collisions are handled by making each entry in the hash table an array. Entries which hash to the same value will have a multi-element, dynamically allocated array of pointers to the timers with that hash value. This array is searched linearly until the correct entry is found. This approach allows GPTL to handle an arbitrary number of collisions. Though there is an added searching expense incurred whenever a collision occurs.

GPTL maintains an internally defined "call stack" to keep track of the current active call tree whenever a new region (i.e. "timer") is entered. This provides an efficient means to record the region's parent in the call tree. A pointer to that parent is defined (or updated) every time the region is entered. This allows GPTL to learn which regions have multiple parents, who the parents are, and how many times the region is invoked by each parent. When timer contents are printed, the parent information embedded in each timer allows a tree data structure to be defined, with pointers from parents to each of their children. This tree is then traversed from the root to all the leaves (children) in order to print the contents of each timer. Depth in the call tree is recorded by indenting the region at print time to depict who the parent is.

When there are multiple parents, the default behavior is to print the timer only once, with its parent listed as the one who calls it the most frequently. Optionally, the user can request that regions with multiple parents be listed for each parent. Recursive effects can cause this option to generate lots of output when regions with many parents also have many children. Other printing options allow regions to be listed only once according to who their first parent is, or who their last parent is.

6 Future Work

High on the list of desired library enhancements is an option to output region information in XML format instead of only ASCII text. This would allow one to leverage the power of XML readers to expand or contract nested regions with the click of a mouse button. Such an ability would be useful for large application codes, particularly when navigating the call tree generated with auto-profiling enabled.

Currently GPTL only keeps track of the number of times a given region was invoked by each of its parents. It does not retain equivalent "as-called-by" statistics for wallclock time, or PAPI-based counter information. There are situations in which it would be beneficial to know for example that function "B" accumulated most of its time when it was called by function "A" rather than when it was called by function "C". Adding such functionality to GPTL should be straightoforward.

The list of MPI primitives supported by the PMPI GPTL layer needs to be completed. Currently only the most commonly used MPI routines are included. Completing the list is a simple task, though somewhat time-consuming owing to the shear number of MPI routines.

7 Download location

GPTL can be downloaded at www.burningserver.net/rosinski/gptl. It is open source code.


References

[1]W.D. Collins, et. al. Description of the NCAR Community Atmosphere Model. http://www.ccsm.ucar.edu/models/atm-cam/docs/description/node1.html
[2] Performance Application Programming Interface. http://icl.cs.utk.edu/papi/
[3] Phil Mucci. PAPIEX Home Page. http://icl.cs.utk.edu/~mucci/papiex/