lforbes@arsc.edu
http://www.arsc.edu/~lforbes
Jan Julianjulian@arsc.edu
Click here for this paper in PDF format.
Unfortunately, the requirements for managing those resources have not changed. User jobs have to be queued, dispatched, and executed on improved but still finite resource pools. As the systems have changed, so have the tools to manage them. Process accounting has evolved (and sometimes devolved), work load management has grown more complicated in response to varying situations, and job execution has developed to fit systems that are now both parallel, and vectorized.
This paper will share some of the modifications and strategies developed at ARSC to manage our 32-node X1. Installed autumn 2003, klondike is the latest ARSC Cray to go into production. Along with it, PBS Pro has replaced NQS, more features of psched are being employed, and process accounting has reverted back to standard UNIX accounting instead of Cray System Accounting (CSA).
ARSC's resource utilization reporting requirements have not changed though. So scripts are being redesigned to incorporate input from these three loosely coupled sources (PBS accounting, psched logging, and process accounting records). Hopefully, other sites will be able to take advantage of some of the lessons we have learned, and maybe help us and Cray to improve X1 resource management.
Klondike.arsc.edu System Version Levels:
Interactive support limits are those applied to user login sessions and commands executed on the support node. Interactive application limits are those applied to aprun or mpirun commands executed from the command line. The limits database (limit_mkdb(8), /etc/acct/limits.db) contains the resource bounds for both of these scopes. We use the limits database primarily to set the interactive support limits. PBS is configured to not include 4 MSPs, thus guaranteeing that they will always be available for interactive jobs. We also limit those jobs to 15 minutes of CPU time. By using the limits database only for the interactive scopes, we have avoided conflicts with the psched and PBS configurations (queue limits for example).
Initially, we tried to identify the batch limit(s) that matched our configuration from the T3E and/or SV1. The primary limit used on the T3E is mppt. However, as Table 1 shows, mppt is the one limit not available on the X1.
Resource | Scope | Enforcer | Operating System |
---|---|---|---|
cput | Job | MOM | All |
mem | Job | MOM | All |
vmem | Job | MOM | All |
ncpus | Job | MOM | All |
walltime | Job | MOM | All |
file | Process | Kernel | All |
pcput | Process | Kernel | All |
pmem | Process | Kernel | All |
pvmem | Process | Kernel | All |
mppe | Job | MOM,Psched | UNICOS/mk,UNICOS/mp |
mppssp | Job | MOM,Psched | UNICOS/mp |
mppt | Job | MOM,Psched | UNICOS/mk |
mppfile | Process | Kernel | UNICOS/mp |
pmppt | Process | Kernel | UNICOS/mp |
pmppmem | Process | Kernel | UNICOS/mp |
pmppvmem | Process | Kernel | UNICOS/mp |
Instead, we chose to use walltime to control how long interactive jobs may execute. This configuration is only partially satisfying because a job may or may not spend all or even a majority of that time using the PEs, which is what we prefer to measure. However, until mppt time is supported, walltime is simpler to enforce compared to trying to calculate time based on individual PEs, ie pmppt.
Limiting memory and the number of PEs for the batch application scope is handled by the PBS queue definitions.
Overall, limits are still in development. As new releases of Unicos/mp, PBS, and psched are made available, the definitions and enforcement mechanisms are updated. This moving target requires re-reviewing configurations with each system and PBS upgrade. Since this is still a developing system, this is acceptable as long as it is taken into account.
As mentioned previously, the "server resources_available.mppe" global setting is set to 120 (124 application MSPs - 4) to ensure that 4 MSPs are always available for interactive work. This allows users to continue debugging or executing small post-processing interactive jobs, even when multiple large jobs are running, thus providing better throughput for both types of jobs. For a while, we also had "server resources_available.mppssp" set to 480, but we discovered that this allows the system to be oversubscribed with SSP jobs. The explanation for why is later, but the solution is to unset "resources_available.mppssp".
Each queue has a resources_max.mppe and resources_max.walltime set to some value depending on the size of jobs expected to be run through that queue. Each queue also has resources_max.mppssp set to 0. All three are enforced by PBS as it periodically scans the executing jobs. The PBS directives in the job submission file determines the queue chosen. However, a job may try to use more PEs on the aprun or mpirun command. When PBS detects a job using more PEs then it was allocated, it will terminate the job. Currently our queues execute both types of job from the same queue.
An interesting effect occurs when trying to configure a queue for both MSP and SSP jobs. Suppose a user wants to run a single MSP job. Initially, our single queue contained the following configuration settings:
Next we discovered that an SSP job executed through the single queue could use 8 SSPs instead of only 4. The resources_max.mppssp is set according to the formula specified above. The condition was much more visible when SSP jobs, which should have been quantified as large, were running out of the medium queue. The solution, also based on the above formula, is to set resources_max.mppssp to 0 in every queue.
Lastly, as the system finally loaded up, we discovered that it could be oversubscribed with SSP jobs. It turns out that the SSP formula also applies to the global server limits, resources_available.mppssp. When the system is full of work, and more SSP jobs are submitted, the single interactive node was not left untouched by PBS. The solution was to unset resources_available.mppssp thus defaulting that setting to 4 times the resources_available.mpp setting.
In both the global and queue configurations, the mppssp settings are now calculated from the mppe settings. However, there is no official documentation on this interaction, which lead to lots of trial and error as well as support from Cray.
After discovering the magic of the SSP formula, the single queue is now configured as follows:
Fairshare decisions can be very complicated and may produce unanticipated results. Rather than try to build a hierarchical structure that encompassed all the projects and users, we allocated 100 shares to the unknown group. Since no users are defined, all users are in the unknown group and received an equal percentage of the entire group's shares.
By lumping all the users into the group unknown, a flat structure is created that requires no intervention to maintain. These are strong positives from an administrative point of view. The usage is measured based on walltime accumulated so that a user who has not been executing jobs will have priority over a user who has been, since both have the same share value. Thus when the queue is evaluated to choose the next job for dispatch, the user who has not been running will be chosen.
Since fairshare only manipulates the job order in a queue, the order that the queues are evaluated (queue priority in our case) overrides the internal queue ordering. Thus, a user in the larger queues who has been running jobs may still have higher prority over users in the smaller queues who have not been running. Because this fits into our priority scheme, it is not a problem here (currently), but other sites may also need to consider using the job starvation settings to enforce their own policies.
By enabling psched_fit, PBS uses information from psched to help determine what jobs may be dispatched next. Thus, PBS has a more accurate view of the available PEs. "Pbs_sched uses information collected by pbs_mom from PScheD to determine whether sufficient MSP and SSP resources to initiate a job exist [4]." This ties right back into the goal of making PBS the arbiter of resources, even though psched actually allocates them.
Record | Generating Daemon | Identifier Field |
---|---|---|
/var/adm/pacct# | process accounting, kernel | process ID (PID) |
/var/log/psched/PsLogYY.JJJ | political scheduling daemon, /usr/libexec/psched | application ID (APID) |
/var/spool/PBS/server_priv/accounting/ CCYYMMDD | PBS server daemon, /opt/pbs/sbin/pbs_server | job ID (req) |
From these three sources, we have been able to draw most of the information necessary to report utilization to the agency that allocates X1 hours to ARSC users[5][6].
Unfortunately, unlike CSA, process accounting on the X1 does not span the entire life of a job. Within a user job, processes only exist when the application is executing on the node(s). One process for each PE is created. Trying to accumulate all this information and develop a full utilization report from just the process accounting records would require a lot of programming and calculation. Even then information such as the time spent queued could not be generated. So it is necessary to move up one more level to see what information is available in psched.
klondike> acctcom -prtu lforbes | head -2 COMMAND START END REAL CPU (SECS) CPU NAME USER TTYNAME TIME TIME (SECS) SYS USER FACTOR SID PID APID ACCT ID klondike> acctcom -prtu lforbes | grep 298073 | grep pxfd pxfd lforbes ? 16:10:46 16:11:42 56.21 2.87 13.32 0.82 298073 298086 298086 0 pxfd lforbes ? 16:10:50 16:11:42 52.29 0.62 12.91 0.95 298073 262932 298086 0 pxfd lforbes ? 16:10:50 16:11:42 52.30 0.65 13.25 0.95 298073 298108 298086 0 pxfd lforbes ? 16:10:50 16:11:42 52.42 0.68 13.32 0.95 298073 298097 298086 0 pxfd lforbes ? 16:12:32 16:13:23 51.63 2.65 41.10 0.94 298073 298086 298086 0 pxfd lforbes ? 16:12:32 16:13:23 51.50 2.66 41.16 0.94 298073 298108 298086 0 pxfd lforbes ? 16:12:32 16:13:23 51.54 2.66 41.16 0.94 298073 298097 298086 0 pxfd lforbes ? 16:12:32 16:13:23 51.45 2.82 40.96 0.94 298073 262932 298086 0 |
Because we are concerned about the length of time PEs are used, and the number that are used, it should be easy to identify from the above information that four PEs were in use by counting the number of processes associated with the same APID. On closer inspection, we see there are eight processes. The job was checkpointed during operation. When it resumed, a new set of processes was generated. The start and end times show the disconnect between the two sets of process. There is nothing else in the process accouting that indicates a checkpoint occurred.
Note that that job's execution time was in the range of 103 to 107 seconds (add together the first and fifth process' CPU seconds). Further, it is possible to tell that the job used most of that time to do actual calculation (by looking at the number of CPU seconds or the CPU factor). This gives an indication of the job's execution time, and demonstrates how well it is taking advantage of the number of PEs allocated to it. Thus accounting can be used to determine system performance.
These calculated times will be equal to, or less than the numbers reported by either psched or PBS. They represent the finest possible utilization granularity short of hardware counters. It is also the most accurate measure of utilization. As such, they make great verification statistics, but they do not tell the whole tale of a user's job. Specifically, they cannot provide an idea of how long a user job has existed, i.e. the qsub time to when results are returned. In addition, they do not provide an idea of the complete execution time of a batch job containing multiple commands, possibly including multiple aprun or mpirun commands.
We have also observed that the memory statistics (not shown in Figure 1) are not always correct. Since memory is not a resource that we use for allocations, scheduling, or report, we have not yet had time to track down the problems. Currently, our evidence is anecdotal at best, but it does appear that each revision of the kernel and operating system reduces the inaccuracies.
The session ID (SID) field (added to the process accounting structure in Unicos/mp 2.4) is an indirect connection between the process accounting, and PBS accounting. The PBS job ID is not, and perhaps cannot be, stored with the process records. The SID, on the other hand, is recorded with every process. It is almost the parent PID; almost because the SID is the PID of the login shell from which the process is initiated. For interactive jobs, this would be the user's login shell from which the aprun or mpirun command is initiated. For batch jobs, this would be the shell launched by PBS from which it executes the job script. The SID is stored in the PBS exit records for each batch job. Using the SID, one can work back to the PBS job ID via the exit record (more on PBS records later).
The application ID (APID) field is a direct connection between the process accounting and psched accounting. Once the process record is obtained, the APID is available to search out any related psched records.
Records for each psched event include posting (putting the job into psched's queue), placing (allocating PEs and memory to a job), launching (starting job execution, deleting (stopping a job for checkpointing), restarting (resuming a checkpointed job), and deleting (stopping a job upon completion). Every job record starts with a timestamp, and since each log file is a single day, a complete history is accumulated. From these records, a history of the job, from psched's perspective, can be generated.
klondike> grep 298086 /var/log/psched/PsLog04.120 29.16:10:48 Posted apid 298086 uid 929 flags bMNAX w:d:N 4:1:0 time UNLIMITED memory UNLIMITED cmd ./pxfd 29.16:10:48 Place apid 298086 in/Domain/app gasid none modules: 0x9 29.16:10:48 Launched apid 298086 29.16:11:44 Deleted apid 298086 Connect time 00:00:00:55 dd:hh:mm:ss 29.16:12:32 Restarting apid 298086 uid 929 flags bMNFUX w:d:N 4:1:0 time UNLIMITED memory UNLIMITED cmd ./pxfd 29.16:12:32 Placed apid 298086 in /Domain/app gasid none modules: 0x8 29.16:13:25 Deleted apid 298086 Connect time 00:00:00:52 dd:hh:mm:ss |
The job in Figure 2 is the same job shown in Figure 1. The APID demonstrates the connection. Again, because the job was checkpointed, there are two start (launch and restart) and end (delete) records generated. The connect time can be added together to see that the job executed for 107 seconds (the greater of the values calculated by adding process accouting statistics together).
Using the record timestamps, it is possible to calculate some time statistics as well. The job was posted at 16:10:48, launched at the same time, and completed at 16:13:25. Therefore it spent no time queued waiting for psched to allocate resources and ran for 157 seconds. It checkpointed at 16:11:44 and was restarted at 16:12:32. The amount of time checkpointed was 48 seconds. Subtracting the checkpoint time from the run time is a wall execution time of 109 seconds. The difference between that and the 107 seconds connect time appears to be the slack time needed for psched to do its work moving the job on and off the PEs. Depending on the reporting requirements, the site has to choose which time (connect time or walltime) to report.
We use psched records to report interactive job resource utilization and we report the calculated wall execution time. Even though the job is not truly executing when checkpointing, those resources are still allocated to it and cannot be used by another job. Therefore the user has to be "charged" for them (the term charged is used loosely since ARSC does not receive money from users, but rather deducts time from SSP hours allocated to the user by the DOD HPC Modernization Program). As long as another user cannot access the resource, it has to be counted as utilized.
There is still information missing that psched cannot provide. In the case of batch user jobs, psched does not record when a user submits a job via qsub. Nor does it have the full history of the user's job beginning with the qsub, and ending with the returned results. Once again, another set of records must be referenced and the only source left is PBS. First, a connection between the psched records and the PBS records must be established.
Unfortunately, the SID is not stored in the psched log records, nor is the PBS job ID (in fact they probably are not available to psched without rewriting some code). The only way to tie the psched log entries to PBS accounting entries is through process accounting. Using the APID, a single process accounting entry is needed in order to establish the SID. From there, the exit record in the PBS accounting can be found, which also contains the PBS job ID. Again, the SID is useful, this time as the indirect connection between psched and PBS.
Because of the conditions enumerated previously, ARSC is able to treat the PBS accounting records as complete and accurate for all batch jobs executed on the X1. We do not have to worry about calculating statistics for jobs that are suspended or displaced by other jobs; nor do we have to attempt to calculate the results of jobs sharing the same PEs. Further, because memory and disk space are not resources that we currently use to calculate allocations, we do not have to calculate the utilization either (but the data is available).
The start time can be corrupted by checkpoint though. Since the restart code in PBS is the same as the original start execution code, the start time information is overwritten with the time of the last restart. The loss of this information makes calculating a job's wait time a little more difficult, but not much. Figure 3 is an example of the records written by the user job that contained the psched job and processes shown in Figures 1 and 2.
klondike> grep 8835 /var/spool/PBS/server_priv/accounting/20040429 04/29/2004 16:10:36;Q;8835.klondike;queue=default 04/29/2004 16:10:36;Q;8835.klondike;queue=Qsmall 04/29/2004 16:10:38;S;8835.klondike;user=lforbes group=staff jobname=pxfd.q queue=Qsmall ctime=1083283836 qtime=1083283836 etime=1083283836 start=1083283838 exec_host=klondike/0 Resource_List.mppe=4 Resource_List.mppssp=0 Resource_List.walltime=00:30:00 04/29/2004 16:11:46;C;8835.klondike; 04/29/2004 16:12:31;T;8835.klondike; 04/29/2004 16:15:04;E;8835.klondike;user=lforbes group=staff jobname=pxfd.q queue=Qsmall ctime=1083283836 qtime=1083283836 etime=1083283950 start=1083283951 exec_host=klondike/0 Resource_List.mppe=4 Resource_List.mppssp=0 Resource_List.walltime=00:30:00 session=298073 end=1083284104 Exit_status=0 resources_used.cpupercent=77 resources_used.cput=00:13:18 resources_used.mem=2165440kb resources_used.mppe=4 resources_used.mppssp=0 resources_used.ncpus=1 resources_used.vmem=439229376kb resources_used.walltime=00:02:50 |
Given the contents of the exit record, we calculate the following usage statistics for batch jobs:
Looking at the exit record in Figure 3, the lapsed time is 268 seconds, the execution time is 170 seconds, and the wait time is 98 seconds. This leads one to wonder why there is such a large difference between the PBS statistics and the psched statistics. It is important to remember that the PBS job includes more than the application. There may be commands before and after the aprun or mpirun which are also accumulating in those time periods. Thus it is important to compare those statistics with the reserved and used times.
In this case, the reserved time is 2720 seconds and the used time is 798 seconds. The used time divided by the number of PEs (in SSPs) is 49 seconds per SSP, approximately the same as the CPU seconds per process and the connect time reported by psched. This double checking indicates that the calculations are based in reality and can be legitimately used in reporting. Of course, the remaining question is how to tie the PBS records to the process accounting and the psched logs.
Once a process record with the APID is found, then the psched records are available as well. The APID becomes the indirect connection between PBS and psched by using the process accounting records.
By relying on three different accounting sources, there will be some inevitable "slop". The three systems do not account for the utilization in the same way, and they each have a slightly different view of the system based upon their functionality. For ARSC though, by configuring PBS to only release an amount of work less than or equal to the available physical resources, the resulting accounting information is accurate enough for us to use. By using process accounting to double check that user jobs are clicking off CPU cycles approximately equal to the values recorded by PBS, the system has enough correctness to be manageable and fair.
Of course every site's usage calculations will be a little different, depending on what matters to their local policies and what resources are most important. However, by utilizing the operating system, the batch system and the scheduling system's accounting mechanisms, every job's resource utilization can be tracked. From these statistics, we are able to generate utilization reports that allow us to "charge" our user's allocations for the resources they use, and monitor the efficiency of the resource usage. By reviewing these statistics we can identify
Learning a new system like the Cray X1 can be an enjoyable challenge. Porting existing tools and requirements to the new architecture provide a method of learning the new system unavailable in other forums. However, it is necessary to have a roadmap of where everything is or the challenge just becomes a major frustration.
This paper attempts to provide the start of a roadmap for administrators who need to learn the resource management features available on the X1 and/or the resource utilization accounting features. At ARSC, we provide daily and monthly utilization information to our funding agency, which is used to verify that users are making use of their allocations. By ensuring that the resources are made available and reporting the utilization, we are ensuring that our user population is maintained, and hopefully even grows as we grow our resources.
The unique challenges of the X1 lie in the way that three different resource management and accounting products, the operating system, PBS, and psched, have to be configured to work together to create an efficient system. There is documentation available for each of the individual products, but only time and experience allows administrators to really make everything work together to create a usable platform. Hopefully our experience will help other X1 sites reduce their time required to achieve that goal.
Jan H. Julian is a Technical Services System Analyst at the Arctic Region Supercomputing Center. He has been working with the Cray X1 for the last 9 months. Prior experience includes 12 years as a Systems Engineer with IBM Corporation and 7 years of experience with the National Weather Service as a system administator.