Expansion Factors and Fiction

Last Update 2000-05-15 by kcarlson

Expansion Factors and Expansion Fictions

Overview (Fiction - Conclusion)

Expansion factors are a measure of batch response time or job turnaround time. For a particular job and for aggregated jobs expansion factors are usually expressed in the forms:


Where:


 EF = n * (QW + WC) / CPU   

 EF = Expansion Factor
  n = 1 or Dedicated PEs for T3E
 QW = Queue Wait Time
 WC = Wall Clock Time in execution   
CPU = CPU Utilization


aEF = Sum (CPU * EF) / Sum (CPU)

aEF = aggregate expansion factor
         (weighted average)
  May be queue specific,
         project specific
         system wide

A detailed discussion of expansion factors can be found in the 'DoD HPC System Utilization Metrics' document. Expansion factors are utilized to demonstrate that batch jobs are "turned around" in a reasonable manner, ideally with no queue wait time where wall clock equates to CPU utilization. The wall clock time will also increase if there is contention with other jobs or if there are I/O delays. Expansion factors are a useful measure of responsiveness since the dividing factor is the CPU time which is generally relatively constant for a system architecture based on the problem size. The "perfect" EF is 1, although that may not really be "perfect" as it indicates the system is probably under utilized if jobs are executed with no delays. The ideal expansion factor would be whatever relative number provides users with a feeling of adequate turnaround. There is no real Rule-of-Thumb for this, but "2" is a good target to stay beneath for the weighted average. Higher numbers are not "bad", especially on a per-job basis, but understanding why they are higher is appropriate.

In principle this appears to be a simple measure, but it gets complicated quickly. Modern systems have multiple or many CPUs. A "job" to a user is often not a single program execution, but may include pre-processing and post-processing steps as well as multiple programs being executed with variable numbers of application PEs on a T3E or utilize multi-tasking on a J90. One can think of the job in terms of segments looking at each T3E MPP execution or J90 multi-tasking event, but where and how to apply the queue wait time becomes difficult.

With T3E MPP jobs one multiplies by the number of PEs dedicated to the job. T3E CPU time can be measured three ways: total CPU ticks (command + application PE), as application PE ticks, or as dedicated PE time. The command PE time is typically ignored. The "measured" CPU time for an application PE is CPU ticks. Since application PEs will "spin" (accumulate time) waiting communications, this will often equate to wall clock less any I/O wait time or any time where the application may choose to consciously sleep a PE. Since application PEs are dedicated to a job, the "CPU" time should be considered the elapsed time the PE was dedicated. For queuing systems like NQS, the PEs may be "allocated" (reserved) to a job for the entire execution, including pre- and post-processing.

With J90 multi-tasking one can use the csa equivalent of time spent on one or more CPUs to ensure the factor does not go less than the perfect "one", but the factor then ceases to be a measure based on CPU consumed (weighted problem size). Although the number cannot readily be extracted, a better measure would be to use the total CPU with multiplier of the average number of CPUs utilized for multi-tasking.

Fiction (Overview - Conclusions)

Expansion Factors can be skewed in a variety of manners.

Queuing system limitations
The discussion in this paper involves the NQS batch queue system, but all systems have some vagaries. With a T3E the application PEs are usually reserved to the jobs requesting them. To schedule the workload NQS must know how many PEs will be utilized. The requested PEs is a maximum, not a complete job count. If a user requests 128 but issues the mpprun with only 64, NQS will over reserve which may result in 64 idle PEs (note: this can be partially circumvented with NQS user exits). The accounting data reports the PEs utilized, not what was "reserved" by the NQS request. Mis-requesting can increase the expansion factor of the offending job and other jobs.
NQS "qsub -a" intrinsic delays
The NQS submit after option will delay a job start until after a specified time. This is an appropriate means to schedule work to begin after prime time or after some other scheduled event. Cray csa accounting measures queue wait time from time of submittal, not time of requested start. The requested start time is not written to either the nqacct accounting records nor in any NQS log. Scheduled jobs like this will have inflated expansion factors.
Job chaining
Job chaining is a concept where a job will submit another job in post-processing. Job chaining is appropriate for users who must execute work in a serial manner or for "normal" priority users who are system-friendly and allow other users to interleave their access to resources. Chained jobs tend to have reduced expansion factors.
Mass submittal by a single user or group
Mass submittal are common for processing which does not have interdependency and is often seen as users leave for the evening or weekend to queue up their processing. The expansion factors for mass submittal will be inflated because either the jobs will be competing against each other or accumulating queue wait time while sibling jobs are executing. At ARSC users are encouraged to chain jobs as that provides a more balanced access to compute resources.
Use of qalter within a job
T3E jobs often have significant post-processing steps which execute on command PEs. Users who do not chain the post processing should be encouraged to issue "qalter -l mpp_p=0" commands to "free" the PEs for NQS scheduling of other jobs. This is an effective technique for pushing more work through the system, but since this is not logged for accounting the job may have an inflated expansion factor as the number of PEs is usually multiplied across the duration of the job.
Queue, project, or user limits
Where user, project, or queue limits are enforced (as they should be) the expansion factors for multiple or mass submittal will be inflated as the user's first job(s) are running the other jobs must wait. If the resource management policies of a data center defines that a user or project or queue type may only run a certain number of concurrent jobs, the queued work "in violation" of those policies are accumulating queue wait time which should not be calculated into expansion factors but cannot reasonably be excluded with existing tools.
For priority users who do not have resource limits, the queued time waiting for system resources should be calculated into expansion factors. However, if it is a Tuesday afternoon in May the rules may be entirely different.
Job development
Expansion factor calculation for a system is typically done in a weighted manner. As defined above, the CPU time is used as a multiplier. While this is appropriate, short running jobs do not contribute much to the over-all expansion factors. However, short running jobs can be the most critical workload for users to develop and test their work prior to production HPC executions. The best way to approach this is to measure and examine expansion factors by batch queue or other defined classes of service.
Job Checkpointing
Use of checkpoint/restart does accrue queue wait time while a job is checkpointed (as it should). This does legitimately increase the expansion factors of checkpointed jobs. At ARSC checkpointing is used to allow short jobs (like job development) and priority workload opportunities to run, which lowers the expansion factors of those jobs. Checkpointing can also be used to reschedule work to fill PE "holes" to more completely utilize application PE's, this can lower the aggregate system expansion factor.
Job restarts
Jobs which are restarted (rerun) due to unexpected system outages (e.g., T3E PE failures) are not properly measured. The csa consolidation does not include either the rerun time or outage time as queue wait time. For jobs which are rerun any CPU accounted for in a pacct record for any completed programs in the job or for checkpointing may be double billed.
Measurement problems (nq_wallclock)
There are a number of other factors which can affect proper measurement when trying to merge queuing system information (NQS) with accounting information (csa). An example of this is shown in 'Using csagfef' where the nq_wallclock time is mis-reported for T3E jobs not going through a pipe queue.

Conclusions (Overview - Fiction)

Would users (or Data Centers) try to skew expansion factor measurements?
Intentionally, no (well... probably not, but one never knows), but different techniques for job submission and work flow management are better for different circumstances. The Data Center's task is to encourage whatever methods ensure responsiveness to both users who have a defined high priority and to those who have just general access to the computing resources.

Essentially, a system-wide expansion factor is not a good measure of anything except for trend analysis and as a start for capacity planning. A system-wide expansion factor may also provide an alert to when workload characteristics have changed. Expansion factors have more meaning when viewed on a queue or project basis as measures of whether a class of service is getting the responsiveness it expects or needs. Comparing expansion factors from multiple systems is a very dubious practice, not only may the resource management policies and priorities vary significantly but the data collection tools and user practices may also be very different.

Taken out of context of the workload of a system, an expansion factor is meaningless. However, expansion factors should still be monitored. Borrowing a technique from transaction processing environments where one measured average response and percentage of responses beneath some threshold (such as targets of average response time of 2 seconds with 95% of all transactions completing in under 1 second) applies to expansion factors. Besides looking at the average expansion factor, review of the outlying jobs with some threshold percentage of jobs beneath some value is useful to determine whether a capacity problem exists or is developing, a queue structure problem exists, or whether a user assistance or education issues exist.