DoD High Performance Computing Modernization Program
HPC System Utilization Metrics
Introduction and General Policy
The HPC system utilization metrics detailed in this document are an important subset of overall program performance metrics collected, aggregated, and analyzed by the HPC Modernization Office. This document does not address cost and schedule metrics, nor does it address other performance metrics, such as user help desk statistics and user satisfaction metrics. It also does not address detailed utilization reports produced by the shared resource centers for their user organizations.
HPC system utilization and job turnaround times are key indicators of the usefulness of HPC Modernization Program resources to DoDís science and technology (S&T) and developmental test and evaluation (DT&E) communities. We plan to use utilization metrics to make resource allocation decisions, plan future resource acquisitions, and validate user requirements. Although maximum throughput is not always the major goal for HPC resources, accurate utilization data is essential for HPC Modernization Program strategic planning.
One principal concern of most users is how quickly a computation can be run in a multi-user environment. For these reasons, the HPC Modernization Office has been collecting utilization and turnaround time data on its HPC systems for the past several years at the user organization level. By evaluating turnaround time metrics defined in this document, we are able to measure how effectively we are supporting our customer base.
The HPC Modernization Program began to collect utilization data for each HPC system at each of its shared resource centers by computational project in FY98. The purpose of this paper is to define the set of utilization metrics that will be collected and how they will be reported in FY99. A common set of utilization metrics will be obtained from each MSRC and all distributed center systems that address non-real-time needs. For HPC systems that address both real-time and non-real-time needs, a separate set of metrics will be kept for each of these two activities. Metrics for the non-real-time workload on such a system will be identical to those for non-real-time systems at the MSRCs and distributed centers. For real-time activities, in addition to a subset of the aforementioned metrics, test or simulation activities performed will be reported as discrete events within a computational project.
Specific Utilization Metrics
CPU Utilization
The primary goal is to track CPU utilization for each computational project. CPU utilization for each job (or job segment) will be determined by multiplying the number of dedicated processors for a job by the number of hours, that job occupies those processors. For this paper, job segment is defined as any part of a job that utilizes a constant number of processors. Thus, utilization will be computed as wall clock hours dedicated to a particular job (or job segment) while that job is in execution. Total utilization for a computational project will be obtained by summing the utilization charged to each job (or job segment) for that project. This total wall clock utilization will be reported for each computational project for each reporting period. It is total wall clock utilization for each project, which will be compared with that projectís allocation, since this quantity represents CPU resources unavailable to all other projects using that HPC system. Utilization to be charged to a projectís allocation and utilization resulting from use of the background queues will be reported separately for each project. For non-real-time systems, which allow non-exclusive use of processors in a time-sharing mode, CPU time, rather than wall-clock time, will be used to compare with a projectís allocation. Utilization on real-time systems will always be reported as dedicated wall clock utilization, since this is the normal mode of operation for those systems. Figure 1 plots CPU utilization for a non-real-time system vs. time for a typical job; the area under the solid line represents wall clock (dedicated) utilization for that job. The area enclosed by the solid line and the line representing the maximum number of processors available on that system represents CPU resources available to all other projects using that HPC system. Note that some systems are not capable of changing the number of dedicated nodes during a job, in which case the dedicated number of processors remains constant.
CPU utilization as defined above will be reported for each user computational project as well as for any system or special accounts, such as 99xx accounts. Note that no utilization should accrue against reserve (9000-9199) accounts, since these accounts are provided to each organization as a mechanism to reallocate HPC resources and not for actual utilization. In addition, each center will report down time in CPU-hours in the following categories: scheduled maintenance, unscheduled down time, and time unavailable for other reasons. The list of system and special accounts to be used to report special types of utilization and down time is given in Attachment 1.
For systems that have system software capable of reporting actual CPU utilization (in terms of total number of hours a CPU is actually engaged during a job or job segment) as well as dedicated wall clock utilization, that quantity, for some representative subset of jobs run on that system, may also be determined and reported. Comparison of actual CPU utilization to dedicated CPU utilization as measured by wall clock times will allow assessments of the effectiveness of system software in scheduling blocks of processors to maximize overall throughput in combination with the effectiveness of applications software in terms of load balancing. Referring again to Figure 1, the area under the dashed line represents actual CPU utilization for that job. As hardware performance monitors become available on various systems at each shared resource center, that shared resource center may work with the HPC Modernization Office to define sets of user jobs for which hardware performance statistics, such as actual numbers of FLOPS performed, are gathered and reported. Comparison of actual system performance for a job to peak performance of the block of processors being used by that job will measure the effectiveness of application software in its utilization of that HPC system. The triangle symbols on Figure 1 represents actual performance on the given job in terms of FLOPS performed.
Figure 1. Individual job utilization
Each shared resource center is asked to assess progress toward multiprocessor utilization by preparing histograms, on a quarterly basis, profiling CPU assignments for all jobs on each of its HPC systems capable of producing the necessary data. The histograms will plot total numbers of jobs and total numbers of CPU hours for each number of CPUs assigned over all jobs run on that system for each quarter. The resultant profile, over time, will document progress toward multiprocessor utilization and capability utilized on each system, as opposed to simply measuring total throughput. Each shared resource center manager may propose additional metrics for documenting multiprocessor utilization and required capability.
Number of Users
For non-real-time activities, the number of HPC users on a particular HPC system will be provided in two ways. The number of active users will be reported as the number of active accounts that show a total of one or more hours of CPU utilization for the reported month. The cumulative number of users for a fiscal year will be reported as the number of active accounts that had one or more hours of CPU utilization for any month within the current fiscal year. The number of users (both active and cumulative) will be reported for each computational project for each reporting period. The number of users for a real-time activity will include a count of all personnel directly involved in the operation of the HPC system during the test or simulation activity. These users will also be reported for each month and cumulatively for the fiscal year.
Expansion Factors
Job turnaround time for non-real-time systems may be captured in terms of the expansion factor, which is defined for an individual job (or job segment) as the total wall clock time from job submission to job completion, divided by the total time that job (or job segment) is actually executing (Equation 1).
(1)
EFi = expansion factor for job iNote that for interactive jobs on non-real-time systems, the queue wait time is zero, so that the expansion factor simplifies to execution wall clock time divided by system execution time. Since CPU utilization for the job (or job segment) is defined as the number of processors dedicated to that job times the time those processors are dedicated (system execution time), the denominator of the expression for the expansion factor can be replaced with CPU utilization for that job, provided the numerator is also multiplied by the number of processors (Equation 2).
(2)
ni = number of dedicated processors for job iThe expansion factor has the property that its minimum (optimum) value is always one, independent of the number of processors.
If the number of processors ni is removed from Equation 2, the resultant expansion factor is unnormalized and thus loses the property that its minimum value is one. The minimum of this unnormalized expansion factor is the reciprocal of the number of processors dedicated to that job. The reciprocal of the unnormalized expansion factor, termed the effective number of processors, can be interpreted as the average number of processors utilized during the entire job time, from job submission to job completion.
Calculation of an expansion factor for a group of jobs must be done by first calculating expansion factors for all individual job segments and then performing a weighted average over all job segments in the group, using the CPU utilization for each individual job segment as its weighting factor. Use of the CPU utilization as the weighting factor in calculating the average ensures that each job segmentís contribution to the expansion factor will depend on its length. Equation 3 illustrates the calculation of this weighted average of the expansion factor for a group of segments.
(3)
S
i = summation over all job segments in the set to be averaged.The expansion factor for a job that utilizes only a single processor has the same value as the unnormalized expansion factor, since the number of processors in Equation 2 is one. Algebraically, it can be shown that the CPU utilization-weighted average of this unnormalized expansion factor can be calculated from the averages of CPU utilization, queue wait time, and execution wall clock time, according to Equation 4.
(4)
Averages of CPU utilization, queue wait time, and execution wall clock time will be reported for each queue on each system. The HPCMO will use this data to compute the unnormalized expansion factor for each system. If available from system accounting tools, the average number of processors utilized by all jobs run from each queue will also be reported.
Commercial-Off-the-Shelf (COTS) Software Utilization Metrics
Utilization of COTS software packages is an important metric to monitor to ensure that COTS software being provided by a center meets usersí needs. Each center will report utilization on each of these COTS software packages by dedicated CPU-hours used on and the number of accesses of that package. This software utilization will be reported for all computational work on each system in FY99.
Other Future Utilization Metrics
The HPC Modernization Office realizes that CPU utilization is not the only metric indicative of efficient and effective HPC system use. In particular, efficient memory utilization is critical to the overall effectiveness of an HPC system to address important applications. Each shared resource center manager may propose potential memory utilization metrics for HPC systems at the shared resource center and a suggested time frame for implementation. In addition, combined metrics (such as CPU utilization coupled with memory utilization) may also be proposed. Input/output operations are also critical to effective operation of an HPC system. Each shared resource center manager may propose potential measures of input/output operations for HPC systems at the shared resource center and a suggested time frame for implementation.
Additional Real-Time Metrics
Each shared resource center with mixed real-time and non-real-time systems will compute all of the non-real-time metrics discussed above for the non-real-time portion of that systemís operation. Of the previously discussed utilization metrics, only dedicated wall clock CPU time and numbers of users will be reported for the real-time portion of each systemís operation, including those systems operated exclusively in real-time mode. In addition, a listing and the duration's of test and simulation real-time activities performed will also be reported for each computational project and test project supported by the real-time operations of each HPC system. These real-time activities include dedicated simulation events and demonstrations run on MSRC systems. For mixed systems, all applicable utilization metrics will be independently tracked for each activity (non-real time and real time) and reported separately.
Tracking and Reporting of Utilization Metrics
All utilization metrics, unless otherwise specified, will be reported monthly by each shared resource center to the HPC Modernization Office as part of its monthly reporting requirements. The HPC Modernization Office will maintain utilization databases and spreadsheets capable of rolling up total utilization for each computational project, user organization, and Service/Agency by system type or total across-the-board utilization in normalized utilization units of gigaFLOPS-years. The HPC Modernization Office will issue quarterly utilization reports to all of its shared resource centers and user organizations. Attachment 2 gives a summary list of utilization metrics discussed in this paper. Attachments 3 and 4 give example utilization metrics report spreadsheets for CPU utilization and COTS software utilization, respectively.
Conclusion
The HPC Modernization Office will work with its shared resource centers to complete the development and implementation of utilization metrics that are vital to the effective monitoring of the success of the HPC Modernization Program. These metrics can provide effective measures of progress in hardware, system software, applications software, and the overall ability of DoD HPC users to take full advantage of the tremendous capabilities and capacity provided by the program.
Attachment 1
CHSSI Test Accounts |
Special Accounts |
|||
Project |
Account |
Type |
Account |
|
CSM-1 |
9901 |
Support |
9999 |
|
CSM-2 |
9902 |
S/AAA |
9998 |
|
CSM-3 |
9903 |
PET |
9997 |
|
CSM-4 |
9904 |
Training |
9996 |
|
CFD-1 |
9905 |
Outreach |
9995 |
|
CFD-2 |
9906 |
Meta-Center Projects |
9994 |
|
CFD-3 |
9907 |
Scheduled Maintenance |
9993 |
|
CFD-4 |
9908 |
Unscheduled Down Time |
9992 |
|
CFD-6 |
9910 |
Time Unavailable for Other Reasons |
9991 |
|
CFD-7 |
9911 |
|||
CCM-1 |
9912 |
|||
CCM-2 |
9913 |
|||
CCM-3 |
9914 |
|||
CCM-4 |
9915 |
|||
CEA-1 |
9916 |
|||
CEA-2 |
9917 |
|||
CEA-3 |
9918 |
|||
CEA-4 |
9919 |
|||
CEA-5 |
9920 |
|||
CEA-6 |
9921 |
|||
CEA-7 |
9922 |
|||
CWO-1 |
9923 |
|||
CWO-2 |
9924 |
|||
CWO-3 |
9925 |
|||
SIP-1 |
9926 |
|||
SIP-2 |
9927 |
|||
SIP-3 |
9928 |
|||
SIP-4 |
9929 |
|||
SIP-5 |
9930 |
|||
SIP-6 |
9931 |
|||
FMS-2 |
9933 |
|||
FMS-3 |
9934 |
|||
FMS-4 |
9935 |
|||
FMS-5 |
9936 |
|||
EQM-1 |
9937 |
|||
EQM-2 |
9938 |
|||
EQM-3 |
9939 |
|||
CEN-1 |
9940 |
|||
CEN-2 |
9941 |
|||
CEN-3 |
9942 |
|||
CEN-4 |
9943 |
|||
IMT-1 |
9944 |
|||
IMT-2 |
9945 |
|||
IMT-3 |
9946 |
|||
IMT-4 |
9947 |
Attachment 2
Specific Utilization Metrics to Be Reported
Monthly Metrics Required for FY99
Quarterly Metrics Required for FY99
Occasional Metrics for FY99
Additional Future Metrics to Be Defined
Metrics for real-time systems and real-time activities on mixed systems will include all metrics except 3 and 4 listed above. Metrics for non-real-time systems and non-real-time activities on mixed systems will include all metrics listed above.
Attachment 3
Attachment 4
Footnotes:
1
Project-level data to be electronically reported separately by utilization counted against allocations and background utilization not counted against allocations (with a "B" suffix on the project number) in an Excel spreadsheet (see Attachment 3).2
To be included in overall utilization reporting to user organizations as defined by the SRCAP.