DoD HPCMP System Utilization Metrics

This document was copied from (Last Modified February 19, 1999):
http://www.hpcmo.hpc.mil/Htdocs/HPCMETRIC/index.html
Refer to source for current copy.

DoD High Performance Computing Modernization Program

HPC System Utilization Metrics

Introduction and General Policy

The HPC system utilization metrics detailed in this document are an important subset of overall program performance metrics collected, aggregated, and analyzed by the HPC Modernization Office. This document does not address cost and schedule metrics, nor does it address other performance metrics, such as user help desk statistics and user satisfaction metrics. It also does not address detailed utilization reports produced by the shared resource centers for their user organizations.

HPC system utilization and job turnaround times are key indicators of the usefulness of HPC Modernization Program resources to DoDís science and technology (S&T) and developmental test and evaluation (DT&E) communities. We plan to use utilization metrics to make resource allocation decisions, plan future resource acquisitions, and validate user requirements. Although maximum throughput is not always the major goal for HPC resources, accurate utilization data is essential for HPC Modernization Program strategic planning.

One principal concern of most users is how quickly a computation can be run in a multi-user environment. For these reasons, the HPC Modernization Office has been collecting utilization and turnaround time data on its HPC systems for the past several years at the user organization level. By evaluating turnaround time metrics defined in this document, we are able to measure how effectively we are supporting our customer base.

The HPC Modernization Program began to collect utilization data for each HPC system at each of its shared resource centers by computational project in FY98. The purpose of this paper is to define the set of utilization metrics that will be collected and how they will be reported in FY99. A common set of utilization metrics will be obtained from each MSRC and all distributed center systems that address non-real-time needs. For HPC systems that address both real-time and non-real-time needs, a separate set of metrics will be kept for each of these two activities. Metrics for the non-real-time workload on such a system will be identical to those for non-real-time systems at the MSRCs and distributed centers. For real-time activities, in addition to a subset of the aforementioned metrics, test or simulation activities performed will be reported as discrete events within a computational project.

Specific Utilization Metrics

CPU Utilization

The primary goal is to track CPU utilization for each computational project. CPU utilization for each job (or job segment) will be determined by multiplying the number of dedicated processors for a job by the number of hours, that job occupies those processors. For this paper, job segment is defined as any part of a job that utilizes a constant number of processors. Thus, utilization will be computed as wall clock hours dedicated to a particular job (or job segment) while that job is in execution. Total utilization for a computational project will be obtained by summing the utilization charged to each job (or job segment) for that project. This total wall clock utilization will be reported for each computational project for each reporting period. It is total wall clock utilization for each project, which will be compared with that projectís allocation, since this quantity represents CPU resources unavailable to all other projects using that HPC system. Utilization to be charged to a projectís allocation and utilization resulting from use of the background queues will be reported separately for each project. For non-real-time systems, which allow non-exclusive use of processors in a time-sharing mode, CPU time, rather than wall-clock time, will be used to compare with a projectís allocation. Utilization on real-time systems will always be reported as dedicated wall clock utilization, since this is the normal mode of operation for those systems. Figure 1 plots CPU utilization for a non-real-time system vs. time for a typical job; the area under the solid line represents wall clock (dedicated) utilization for that job. The area enclosed by the solid line and the line representing the maximum number of processors available on that system represents CPU resources available to all other projects using that HPC system. Note that some systems are not capable of changing the number of dedicated nodes during a job, in which case the dedicated number of processors remains constant.

CPU utilization as defined above will be reported for each user computational project as well as for any system or special accounts, such as 99xx accounts. Note that no utilization should accrue against reserve (9000-9199) accounts, since these accounts are provided to each organization as a mechanism to reallocate HPC resources and not for actual utilization. In addition, each center will report down time in CPU-hours in the following categories: scheduled maintenance, unscheduled down time, and time unavailable for other reasons. The list of system and special accounts to be used to report special types of utilization and down time is given in Attachment 1.

For systems that have system software capable of reporting actual CPU utilization (in terms of total number of hours a CPU is actually engaged during a job or job segment) as well as dedicated wall clock utilization, that quantity, for some representative subset of jobs run on that system, may also be determined and reported. Comparison of actual CPU utilization to dedicated CPU utilization as measured by wall clock times will allow assessments of the effectiveness of system software in scheduling blocks of processors to maximize overall throughput in combination with the effectiveness of applications software in terms of load balancing. Referring again to Figure 1, the area under the dashed line represents actual CPU utilization for that job. As hardware performance monitors become available on various systems at each shared resource center, that shared resource center may work with the HPC Modernization Office to define sets of user jobs for which hardware performance statistics, such as actual numbers of FLOPS performed, are gathered and reported. Comparison of actual system performance for a job to peak performance of the block of processors being used by that job will measure the effectiveness of application software in its utilization of that HPC system. The triangle symbols on Figure 1 represents actual performance on the given job in terms of FLOPS performed.

Figure 1. Individual job utilization

Each shared resource center is asked to assess progress toward multiprocessor utilization by preparing histograms, on a quarterly basis, profiling CPU assignments for all jobs on each of its HPC systems capable of producing the necessary data. The histograms will plot total numbers of jobs and total numbers of CPU hours for each number of CPUs assigned over all jobs run on that system for each quarter. The resultant profile, over time, will document progress toward multiprocessor utilization and capability utilized on each system, as opposed to simply measuring total throughput. Each shared resource center manager may propose additional metrics for documenting multiprocessor utilization and required capability.

Number of Users

For non-real-time activities, the number of HPC users on a particular HPC system will be provided in two ways. The number of active users will be reported as the number of active accounts that show a total of one or more hours of CPU utilization for the reported month. The cumulative number of users for a fiscal year will be reported as the number of active accounts that had one or more hours of CPU utilization for any month within the current fiscal year. The number of users (both active and cumulative) will be reported for each computational project for each reporting period. The number of users for a real-time activity will include a count of all personnel directly involved in the operation of the HPC system during the test or simulation activity. These users will also be reported for each month and cumulatively for the fiscal year.

Expansion Factors

Job turnaround time for non-real-time systems may be captured in terms of the expansion factor, which is defined for an individual job (or job segment) as the total wall clock time from job submission to job completion, divided by the total time that job (or job segment) is actually executing (Equation 1).

EF_i = (QWT_i+ WCT_i) / ET_i

(1)

EF_i = expansion factor for job i
QWT_i = queue wait time for job i
WCT_i = execution wall clock time --> time from beginning to end of execution for job i
ET_i = system execution time --> total time job i is actually being executed

Note that for interactive jobs on non-real-time systems, the queue wait time is zero, so that the expansion factor simplifies to execution wall clock time divided by system execution time. Since CPU utilization for the job (or job segment) is defined as the number of processors dedicated to that job times the time those processors are dedicated (system execution time), the denominator of the expression for the expansion factor can be replaced with CPU utilization for that job, provided the numerator is also multiplied by the number of processors (Equation 2).

EF_i = n_i(QWT_i+ WCT_i) / CPU_i

(2)

n_i = number of dedicated processors for job i
CPU_i = CPU utilization for job i

The expansion factor has the property that its minimum (optimum) value is always one, independent of the number of processors.

If the number of processors n_i is removed from Equation 2, the resultant expansion factor is unnormalized and thus loses the property that its minimum value is one. The minimum of this unnormalized expansion factor is the reciprocal of the number of processors dedicated to that job. The reciprocal of the unnormalized expansion factor, termed the effective number of processors, can be interpreted as the average number of processors utilized during the entire job time, from job submission to job completion.

Calculation of an expansion factor for a group of jobs must be done by first calculating expansion factors for all individual job segments and then performing a weighted average over all job segments in the group, using the CPU utilization for each individual job segment as its weighting factor. Use of the CPU utilization as the weighting factor in calculating the average ensures that each job segmentís contribution to the expansion factor will depend on its length. Equation 3 illustrates the calculation of this weighted average of the expansion factor for a group of segments.

(3)

= CPU utilization-weighted average of the normalized expansion factor

S_i = summation over all job segments in the set to be averaged.

The expansion factor for a job that utilizes only a single processor has the same value as the unnormalized expansion factor, since the number of processors in Equation 2 is one. Algebraically, it can be shown that the CPU utilization-weighted average of this unnormalized expansion factor can be calculated from the averages of CPU utilization, queue wait time, and execution wall clock time, according to Equation 4.

(4)

= CPU utilization-weighted average of the unnormalized expansion factor
<QWT> = average queue wait time
<WCT> = average execution wall clock time
<CPU> = average CPU utilization

Averages of CPU utilization, queue wait time, and execution wall clock time will be reported for each queue on each system. The HPCMO will use this data to compute the unnormalized expansion factor for each system. If available from system accounting tools, the average number of processors utilized by all jobs run from each queue will also be reported.

Commercial-Off-the-Shelf (COTS) Software Utilization Metrics

Utilization of COTS software packages is an important metric to monitor to ensure that COTS software being provided by a center meets usersí needs. Each center will report utilization on each of these COTS software packages by dedicated CPU-hours used on and the number of accesses of that package. This software utilization will be reported for all computational work on each system in FY99.

Other Future Utilization Metrics

The HPC Modernization Office realizes that CPU utilization is not the only metric indicative of efficient and effective HPC system use. In particular, efficient memory utilization is critical to the overall effectiveness of an HPC system to address important applications. Each shared resource center manager may propose potential memory utilization metrics for HPC systems at the shared resource center and a suggested time frame for implementation. In addition, combined metrics (such as CPU utilization coupled with memory utilization) may also be proposed. Input/output operations are also critical to effective operation of an HPC system. Each shared resource center manager may propose potential measures of input/output operations for HPC systems at the shared resource center and a suggested time frame for implementation.

Additional Real-Time Metrics

Each shared resource center with mixed real-time and non-real-time systems will compute all of the non-real-time metrics discussed above for the non-real-time portion of that systemís operation. Of the previously discussed utilization metrics, only dedicated wall clock CPU time and numbers of users will be reported for the real-time portion of each systemís operation, including those systems operated exclusively in real-time mode. In addition, a listing and the duration's of test and simulation real-time activities performed will also be reported for each computational project and test project supported by the real-time operations of each HPC system. These real-time activities include dedicated simulation events and demonstrations run on MSRC systems. For mixed systems, all applicable utilization metrics will be independently tracked for each activity (non-real time and real time) and reported separately.

Tracking and Reporting of Utilization Metrics

All utilization metrics, unless otherwise specified, will be reported monthly by each shared resource center to the HPC Modernization Office as part of its monthly reporting requirements. The HPC Modernization Office will maintain utilization databases and spreadsheets capable of rolling up total utilization for each computational project, user organization, and Service/Agency by system type or total across-the-board utilization in normalized utilization units of gigaFLOPS-years. The HPC Modernization Office will issue quarterly utilization reports to all of its shared resource centers and user organizations. Attachment 2 gives a summary list of utilization metrics discussed in this paper. Attachments 3 and 4 give example utilization metrics report spreadsheets for CPU utilization and COTS software utilization, respectively.

Conclusion

The HPC Modernization Office will work with its shared resource centers to complete the development and implementation of utilization metrics that are vital to the effective monitoring of the success of the HPC Modernization Program. These metrics can provide effective measures of progress in hardware, system software, applications software, and the overall ability of DoD HPC users to take full advantage of the tremendous capabilities and capacity provided by the program.

Attachment 1

CHSSI Test Accounts		Special Accounts
Project	Account	Type	Account
CSM-1	9901	Support	9999
CSM-2	9902	S/AAA	9998
CSM-3	9903	PET	9997
CSM-4	9904	Training	9996
CFD-1	9905	Outreach	9995
CFD-2	9906	Meta-Center Projects	9994
CFD-3	9907	Scheduled Maintenance	9993
CFD-4	9908	Unscheduled Down Time	9992
CFD-6	9910	Time Unavailable for Other Reasons	9991
CFD-7	9911
CCM-1	9912
CCM-2	9913
CCM-3	9914
CCM-4	9915
CEA-1	9916
CEA-2	9917
CEA-3	9918
CEA-4	9919
CEA-5	9920
CEA-6	9921
CEA-7	9922
CWO-1	9923
CWO-2	9924
CWO-3	9925
SIP-1	9926
SIP-2	9927
SIP-3	9928
SIP-4	9929
SIP-5	9930
SIP-6	9931
FMS-2	9933
FMS-3	9934
FMS-4	9935
FMS-5	9936
EQM-1	9937
EQM-2	9938
EQM-3	9939
CEN-1	9940
CEN-2	9941
CEN-3	9942
CEN-4	9943
IMT-1	9944
IMT-2	9945
IMT-3	9946
IMT-4	9947

Attachment 2

Specific Utilization Metrics to Be Reported

Monthly Metrics Required for FY99

Project-level utilization for each computational project, including DoD Challenge Projects¹.
Number of active and cumulative users for each computational project¹.
Average queue wait time, average execution wall clock time, average CPU utilization, average number of processors, and expansion factor for each queue on each HPC system².
Expansion factor for each DoD Challenge Project on each HPC system.
System time in CPU-hours for each system, categorized into scheduled maintenance, unscheduled down time, and unavailable for other reasons (see Attachment 3).
Separation of all utilization metrics between real-time and non-real-time activities for systems that have both kinds of workloads.
A listing of each discrete test or simulation performed in a real-time mode.
COTS software utilization as both dedicated CPU-hours used on and number of accesses of each COTS software package for entire system(see Attachment 4).

Quarterly Metrics Required for FY99

Histograms showing distribution of workload by number of processors, including number of jobs and total CPU time.

Occasional Metrics for FY99

Actual CPU time, as compared with dedicated wall clock utilization.
Hardware performance monitoring.

Additional Future Metrics to Be Defined

Additional metrics for documenting multiprocessor utilization.
Memory utilization metrics.
Input/output and/or storage metrics.
Combined metrics.

Metrics for real-time systems and real-time activities on mixed systems will include all metrics except 3 and 4 listed above. Metrics for non-real-time systems and non-real-time activities on mixed systems will include all metrics listed above.

Attachment 3

Example of a Utilization Spreadsheet
Feb. 1998

Attachment 4

Example of a Software Spreadsheet

Footnotes:
^{1 Project-level data to be electronically reported separately by utilization counted against allocations and background utilization not counted against allocations (with a "B" suffix on the project number) in an Excel spreadsheet (see Attachment 3).

^{2 To be included in overall utilization reporting to user organizations as defined by the SRCAP.

Last Modified February 19, 1999

questions or comments please email: webmaster@hpcmo.hpc.mil

HPCMO All Rights Reserved.}}