This document was copied from (Last Modified February 19, 1999):
http://www.hpcmo.hpc.mil/Htdocs/HPCMETRIC/index.html
Refer to source for current copy.

DoD High Performance Computing Modernization Program

HPC System Utilization Metrics

 

Introduction and General Policy

The HPC system utilization metrics detailed in this document are an important subset of overall program performance metrics collected, aggregated, and analyzed by the HPC Modernization Office. This document does not address cost and schedule metrics, nor does it address other performance metrics, such as user help desk statistics and user satisfaction metrics. It also does not address detailed utilization reports produced by the shared resource centers for their user organizations.

HPC system utilization and job turnaround times are key indicators of the usefulness of HPC Modernization Program resources to DoDís science and technology (S&T) and developmental test and evaluation (DT&E) communities. We plan to use utilization metrics to make resource allocation decisions, plan future resource acquisitions, and validate user requirements. Although maximum throughput is not always the major goal for HPC resources, accurate utilization data is essential for HPC Modernization Program strategic planning.

One principal concern of most users is how quickly a computation can be run in a multi-user environment. For these reasons, the HPC Modernization Office has been collecting utilization and turnaround time data on its HPC systems for the past several years at the user organization level. By evaluating turnaround time metrics defined in this document, we are able to measure how effectively we are supporting our customer base.

The HPC Modernization Program began to collect utilization data for each HPC system at each of its shared resource centers by computational project in FY98. The purpose of this paper is to define the set of utilization metrics that will be collected and how they will be reported in FY99. A common set of utilization metrics will be obtained from each MSRC and all distributed center systems that address non-real-time needs. For HPC systems that address both real-time and non-real-time needs, a separate set of metrics will be kept for each of these two activities. Metrics for the non-real-time workload on such a system will be identical to those for non-real-time systems at the MSRCs and distributed centers. For real-time activities, in addition to a subset of the aforementioned metrics, test or simulation activities performed will be reported as discrete events within a computational project.

Specific Utilization Metrics

CPU Utilization

The primary goal is to track CPU utilization for each computational project. CPU utilization for each job (or job segment) will be determined by multiplying the number of dedicated processors for a job by the number of hours, that job occupies those processors. For this paper, job segment is defined as any part of a job that utilizes a constant number of processors. Thus, utilization will be computed as wall clock hours dedicated to a particular job (or job segment) while that job is in execution. Total utilization for a computational project will be obtained by summing the utilization charged to each job (or job segment) for that project. This total wall clock utilization will be reported for each computational project for each reporting period. It is total wall clock utilization for each project, which will be compared with that projectís allocation, since this quantity represents CPU resources unavailable to all other projects using that HPC system. Utilization to be charged to a projectís allocation and utilization resulting from use of the background queues will be reported separately for each project. For non-real-time systems, which allow non-exclusive use of processors in a time-sharing mode, CPU time, rather than wall-clock time, will be used to compare with a projectís allocation. Utilization on real-time systems will always be reported as dedicated wall clock utilization, since this is the normal mode of operation for those systems. Figure 1 plots CPU utilization for a non-real-time system vs. time for a typical job; the area under the solid line represents wall clock (dedicated) utilization for that job. The area enclosed by the solid line and the line representing the maximum number of processors available on that system represents CPU resources available to all other projects using that HPC system. Note that some systems are not capable of changing the number of dedicated nodes during a job, in which case the dedicated number of processors remains constant.

CPU utilization as defined above will be reported for each user computational project as well as for any system or special accounts, such as 99xx accounts. Note that no utilization should accrue against reserve (9000-9199) accounts, since these accounts are provided to each organization as a mechanism to reallocate HPC resources and not for actual utilization. In addition, each center will report down time in CPU-hours in the following categories: scheduled maintenance, unscheduled down time, and time unavailable for other reasons. The list of system and special accounts to be used to report special types of utilization and down time is given in Attachment 1.

For systems that have system software capable of reporting actual CPU utilization (in terms of total number of hours a CPU is actually engaged during a job or job segment) as well as dedicated wall clock utilization, that quantity, for some representative subset of jobs run on that system, may also be determined and reported. Comparison of actual CPU utilization to dedicated CPU utilization as measured by wall clock times will allow assessments of the effectiveness of system software in scheduling blocks of processors to maximize overall throughput in combination with the effectiveness of applications software in terms of load balancing. Referring again to Figure 1, the area under the dashed line represents actual CPU utilization for that job. As hardware performance monitors become available on various systems at each shared resource center, that shared resource center may work with the HPC Modernization Office to define sets of user jobs for which hardware performance statistics, such as actual numbers of FLOPS performed, are gathered and reported. Comparison of actual system performance for a job to peak performance of the block of processors being used by that job will measure the effectiveness of application software in its utilization of that HPC system. The triangle symbols on Figure 1 represents actual performance on the given job in terms of FLOPS performed.

 

Figure 1. Individual job utilization

 

Each shared resource center is asked to assess progress toward multiprocessor utilization by preparing histograms, on a quarterly basis, profiling CPU assignments for all jobs on each of its HPC systems capable of producing the necessary data. The histograms will plot total numbers of jobs and total numbers of CPU hours for each number of CPUs assigned over all jobs run on that system for each quarter. The resultant profile, over time, will document progress toward multiprocessor utilization and capability utilized on each system, as opposed to simply measuring total throughput. Each shared resource center manager may propose additional metrics for documenting multiprocessor utilization and required capability.

Number of Users

For non-real-time activities, the number of HPC users on a particular HPC system will be provided in two ways. The number of active users will be reported as the number of active accounts that show a total of one or more hours of CPU utilization for the reported month. The cumulative number of users for a fiscal year will be reported as the number of active accounts that had one or more hours of CPU utilization for any month within the current fiscal year. The number of users (both active and cumulative) will be reported for each computational project for each reporting period. The number of users for a real-time activity will include a count of all personnel directly involved in the operation of the HPC system during the test or simulation activity. These users will also be reported for each month and cumulatively for the fiscal year.

Expansion Factors

Job turnaround time for non-real-time systems may be captured in terms of the expansion factor, which is defined for an individual job (or job segment) as the total wall clock time from job submission to job completion, divided by the total time that job (or job segment) is actually executing (Equation 1).

EFi = (QWTi + WCTi) / ETi

(1)

EFi = expansion factor for job i
QWTi = queue wait time for job i
WCTi = execution wall clock time --> time from beginning to end of execution for job i
ETi = system execution time --> total time job i is actually being executed

Note that for interactive jobs on non-real-time systems, the queue wait time is zero, so that the expansion factor simplifies to execution wall clock time divided by system execution time. Since CPU utilization for the job (or job segment) is defined as the number of processors dedicated to that job times the time those processors are dedicated (system execution time), the denominator of the expression for the expansion factor can be replaced with CPU utilization for that job, provided the numerator is also multiplied by the number of processors (Equation 2).

EFi = ni(QWTi + WCTi) / CPUi

(2)

ni = number of dedicated processors for job i
CPUi = CPU utilization for job i

The expansion factor has the property that its minimum (optimum) value is always one, independent of the number of processors.

If the number of processors ni is removed from Equation 2, the resultant expansion factor is unnormalized and thus loses the property that its minimum value is one. The minimum of this unnormalized expansion factor is the reciprocal of the number of processors dedicated to that job. The reciprocal of the unnormalized expansion factor, termed the effective number of processors, can be interpreted as the average number of processors utilized during the entire job time, from job submission to job completion.

Calculation of an expansion factor for a group of jobs must be done by first calculating expansion factors for all individual job segments and then performing a weighted average over all job segments in the group, using the CPU utilization for each individual job segment as its weighting factor. Use of the CPU utilization as the weighting factor in calculating the average ensures that each job segmentís contribution to the expansion factor will depend on its length. Equation 3 illustrates the calculation of this weighted average of the expansion factor for a group of segments.

(3)

= CPU utilization-weighted average of the normalized expansion factor

Si = summation over all job segments in the set to be averaged.

The expansion factor for a job that utilizes only a single processor has the same value as the unnormalized expansion factor, since the number of processors in Equation 2 is one. Algebraically, it can be shown that the CPU utilization-weighted average of this unnormalized expansion factor can be calculated from the averages of CPU utilization, queue wait time, and execution wall clock time, according to Equation 4.

(4)

= CPU utilization-weighted average of the unnormalized expansion factor
<QWT> = average queue wait time
<WCT> = average execution wall clock time
<CPU> = average CPU utilization

Averages of CPU utilization, queue wait time, and execution wall clock time will be reported for each queue on each system. The HPCMO will use this data to compute the unnormalized expansion factor for each system. If available from system accounting tools, the average number of processors utilized by all jobs run from each queue will also be reported.

 

Commercial-Off-the-Shelf (COTS) Software Utilization Metrics

Utilization of COTS software packages is an important metric to monitor to ensure that COTS software being provided by a center meets usersí needs. Each center will report utilization on each of these COTS software packages by dedicated CPU-hours used on and the number of accesses of that package. This software utilization will be reported for all computational work on each system in FY99.

Other Future Utilization Metrics

The HPC Modernization Office realizes that CPU utilization is not the only metric indicative of efficient and effective HPC system use. In particular, efficient memory utilization is critical to the overall effectiveness of an HPC system to address important applications. Each shared resource center manager may propose potential memory utilization metrics for HPC systems at the shared resource center and a suggested time frame for implementation. In addition, combined metrics (such as CPU utilization coupled with memory utilization) may also be proposed. Input/output operations are also critical to effective operation of an HPC system. Each shared resource center manager may propose potential measures of input/output operations for HPC systems at the shared resource center and a suggested time frame for implementation.

 

Additional Real-Time Metrics

Each shared resource center with mixed real-time and non-real-time systems will compute all of the non-real-time metrics discussed above for the non-real-time portion of that systemís operation. Of the previously discussed utilization metrics, only dedicated wall clock CPU time and numbers of users will be reported for the real-time portion of each systemís operation, including those systems operated exclusively in real-time mode. In addition, a listing and the duration's of test and simulation real-time activities performed will also be reported for each computational project and test project supported by the real-time operations of each HPC system. These real-time activities include dedicated simulation events and demonstrations run on MSRC systems. For mixed systems, all applicable utilization metrics will be independently tracked for each activity (non-real time and real time) and reported separately.

 

Tracking and Reporting of Utilization Metrics

All utilization metrics, unless otherwise specified, will be reported monthly by each shared resource center to the HPC Modernization Office as part of its monthly reporting requirements. The HPC Modernization Office will maintain utilization databases and spreadsheets capable of rolling up total utilization for each computational project, user organization, and Service/Agency by system type or total across-the-board utilization in normalized utilization units of gigaFLOPS-years. The HPC Modernization Office will issue quarterly utilization reports to all of its shared resource centers and user organizations. Attachment 2 gives a summary list of utilization metrics discussed in this paper. Attachments 3 and 4 give example utilization metrics report spreadsheets for CPU utilization and COTS software utilization, respectively.

 

Conclusion

The HPC Modernization Office will work with its shared resource centers to complete the development and implementation of utilization metrics that are vital to the effective monitoring of the success of the HPC Modernization Program. These metrics can provide effective measures of progress in hardware, system software, applications software, and the overall ability of DoD HPC users to take full advantage of the tremendous capabilities and capacity provided by the program.

Attachment 1

CHSSI Test Accounts

Special Accounts

Project

Account

Type

Account

CSM-1

9901

Support

9999

CSM-2

9902

S/AAA

9998

CSM-3

9903

PET

9997

CSM-4

9904

Training

9996

CFD-1

9905

Outreach

9995

CFD-2

9906

Meta-Center Projects

9994

CFD-3

9907

Scheduled Maintenance

9993

CFD-4

9908

Unscheduled Down Time

9992

CFD-6

9910

Time Unavailable for Other Reasons

9991

CFD-7

9911

CCM-1

9912

CCM-2

9913

CCM-3

9914

CCM-4

9915

CEA-1

9916

CEA-2

9917

CEA-3

9918

CEA-4

9919

CEA-5

9920

CEA-6

9921

CEA-7

9922

CWO-1

9923

CWO-2

9924

CWO-3

9925

SIP-1

9926

SIP-2

9927

SIP-3

9928

SIP-4

9929

SIP-5

9930

SIP-6

9931

FMS-2

9933

FMS-3

9934

FMS-4

9935

FMS-5

9936

EQM-1

9937

EQM-2

9938

EQM-3

9939

CEN-1

9940

CEN-2

9941

CEN-3

9942

CEN-4

9943

IMT-1

9944

IMT-2

9945

IMT-3

9946

IMT-4

9947

Attachment 2

Specific Utilization Metrics to Be Reported

Monthly Metrics Required for FY99

  1. Project-level utilization for each computational project, including DoD Challenge Projects1.
  2. Number of active and cumulative users for each computational project1.
  3. Average queue wait time, average execution wall clock time, average CPU utilization, average number of processors, and expansion factor for each queue on each HPC system2.
  4. Expansion factor for each DoD Challenge Project on each HPC system.
  5. System time in CPU-hours for each system, categorized into scheduled maintenance, unscheduled down time, and unavailable for other reasons (see Attachment 3).
  6. Separation of all utilization metrics between real-time and non-real-time activities for systems that have both kinds of workloads.
  7. A listing of each discrete test or simulation performed in a real-time mode.
  8. COTS software utilization as both dedicated CPU-hours used on and number of accesses of each COTS software package for entire system (see Attachment 4).

Quarterly Metrics Required for FY99

  1. Histograms showing distribution of workload by number of processors, including number of jobs and total CPU time.

Occasional Metrics for FY99

  1. Actual CPU time, as compared with dedicated wall clock utilization.
  2. Hardware performance monitoring.

Additional Future Metrics to Be Defined

  1. Additional metrics for documenting multiprocessor utilization.
  2. Memory utilization metrics.
  3. Input/output and/or storage metrics.
  4. Combined metrics.

Metrics for real-time systems and real-time activities on mixed systems will include all metrics except 3 and 4 listed above. Metrics for non-real-time systems and non-real-time activities on mixed systems will include all metrics listed above.

 

Attachment 3

Example of a Utilization Spreadsheet
Feb. 1998

 

Attachment 4

Example of a Software Spreadsheet

 

 

Footnotes:

1 Project-level data to be electronically reported separately by utilization counted against allocations and background utilization not counted against allocations (with a "B" suffix on the project number) in an Excel spreadsheet (see Attachment 3).

2 To be included in overall utilization reporting to user organizations as defined by the SRCAP.

 

 

Last Modified February 19, 1999
questions or comments please email: webmaster@hpcmo.hpc.mil
HPCMO All Rights Reserved.