Last Update 2000-05-15 by kcarlson


Reporting Unicos and Unicos/mk System Utilization


Table of Contents:

  1. Title page
  2. Outline
  3. Body of presentation (this document)
    1. Objectives
    2. Methods
    3. Lessons
  4. Document index
CUG Summit 2000
Noordwijk, The Netherlands
Tuesday 23 May 2000 14:30
Track C: Operations
Kurt Carlson
Arctic Region Supercomputing Center
University of Alaska
ARSC

Objectives (ToC - Objectives - Methods - Lessons - References)

This presentation will demonstrate data sources, tools, techniques, and background perspectives for Unicos[/mk] system utilization reporting. Although the primary function of accounting data is resource reporting, we will also explore where the UNIX accounting data can be used for other purposes. Some sample automated reports copied from the ARSC intranet are linked from 'Sample Reports'.

Daily, weekly, monthly, and historical reports on job, queue, project, and aggregate expansion factors are generated from csagcon, csagfef, and nawk post-processing of the daily session Super-record files generated by csarun. An ad hoc example:

yukon: nawk -f /usr/local/adm/sbin/csanawk.yukon 200004
#              num  exp Expan  CPUtime  MPPtime  DEDtime  PE-Rsrv Execute Q_Wait
#Queue Name   jobs jobs Fact.  (hours)  (hours)  (hours)  (hours) (hours) (hours)
#==========   ==== ==== =====  =======  =======  =======  ======= ======= =======
*minimum     99746    0   .     91.631      0.3      0.8      0.8     0.0     0.0
*root        16239    0   .    304.569      0.0      0.0      0.0     0.0     0.0
Interactive    689    0   .     80.034     85.7    113.2    113.2     0.0     0.0
Qgrand          12   12 104      0.018    261.6    308.6    380.7     1.5   153.0
Qlarge          62   62  7.79    0.039    320.2    341.1    372.8     8.7    62.8
Qmedium         23   23  3.28    0.024     48.1     49.4     58.6     3.7     8.4
Qsmall          59   55  2.13    0.575     55.8     66.3     73.8    10.5    12.0
Qxlarge         15   15  6.71    0.021    170.4    174.1    195.5     2.6    14.9
Qxxlarge        10   10  3.42    0.009    217.7    224.6    243.2     1.9     4.7
System           1    0   .      0.458      0.0      0.0      0.0     0.9     0.0
gcp_grand        1    1 12.75    0.001    883.0    890.9    891.7     3.5    40.9
gcp_xxlarge    362  359  1.54    0.469  93408.2  98651.2 102061.6  1274.8   694.8
grand            2    2  1.64    0.001    700.6    701.0    710.2     4.1     2.6
large          157  155  2.52    1.821  16174.0  16260.9  16482.1   401.5   675.1
medium         153  152  1.61    2.026  12070.5  12291.9  12504.1   704.5   437.5
single          41    0   .     23.166      0.0      0.0      0.0    70.0     0.3
small          162  139  1.60   24.845   2689.8   2722.7   2771.3   419.8   239.9
xlarge          54   53  2.89    1.048  21405.3  21731.7  21960.7   291.0   544.0
xxlarge         14   14  1.54    0.075   6110.3   6199.8   6342.4    51.5    28.5
#==========   ==== ==== =====  =======  =======  =======  ======= ======= =======
# 1188,689    1877 1052  2.15  530.830 154601.5 160728.4 165162.8  3250.5  2919.5
Monthly and demand month-to-date CPU and other system resource utilization reports by project and aggregate are generated by post-processing of the daily cacct files generated by csacon using the ARSC uacrep program. An ad hoc example:
yukon: uacrep -level 2 -h -f "`ls -1r 20000?.Sum`" -show mpp
#Host:  yukon           #Tags:  200004 200003 200002 200001
#
User   MPP
Count  Hours        200004 200003 200002 200001
----- ------        ------ ------ ------ ------
   57 579374  85.2% 151144 153628 141294 133308 DOD
    2  52035   7.7% 9874.0  14374  14568  13218 DOD/ARMY
   27 379904  55.9% 106170  94679  92594  86461 DOD/CHALLENG
   28 147436  21.7%  35100  44575  34132  33628 DOD/NAVY

   31                                           ROOT

   62 100556  14.8% 9584.4  27646  25176  38149 UA
   34  99094  14.6% 9502.5  26885  24623  38083 UA/ACADEMIC
   25  456.8           8.8  408.7   34.7    4.6 UA/ARSC
    1                                           UA/FEDERAL
    2 1005.3    .1%   73.1  352.2  518.7   61.3 UA/IARC
----- ------        ------ ------ ------ ------
  150 679930        160728 181274 166470 171457 Totals
#
Both of the report methods above utilize consolidated text data files which can be used for a variety of other ad hoc "what if" reports as well as for any general "billing" report requirements. The process is to consolidate raw data into a readily usable form.

Methods (ToC - Objectives - Methods - Lessons - References)

Everything starts with the raw accounting data. This is binary data defined in structures within /usr/include/sys/acct.h. The Cray csa accounting is different from traditional UNIX accounting by extensions to retain job information (jobid and seqno), process information (pid and ppid), additional metrics, projects, and a direct link to the queuing system (NQS). Ignoring the resource management and billing perspectives, simply using the pid and ppid fields in csa accounting allows reconstruction of process relationships which is very useful in post-analysis of many system or application problems. The raw accounting data for Unicos[/mk] is in /usr/adm/acct/day/pacct and nqacct. The provided tool for Unicos[/mk] and most UNIX systems to dump and report pacct files is acctcom. Another tool, uakpacct (available from 'ftp.alaska.edu/pub/sois/') also dumps, reports, and filters both pacct and nqacct files. See 'Why uakpacct and|or acctcom' for a comparison of these tools. One or both is necessary for filtering and interpreting raw accounting data.

For resource utilization reporting the raw files must be aggregated/consolidated in some manner. Under Unicos[/mk] this is done with csarun which is summarized in 'csarun overview'. This job consolidates the raw per-process accounting records into job records on a daily basis and uses carry-forward over successive runs (days) until jobs complete. The consolidated session Super-record files can then be used for general reporting.

The Cray provided csa tools provide no data archiving, but they do provide exit points for sites to define their own. The ARCHIVE1 (csa.archive1) exit is intended for saving raw accounting files and the ARCHIVE2 (csa.archive2) exit is intended for saving the session Super-record files. Both the raw and consolidated files should be saved. The raw files are useful for problem analysis and the Super-record files are useful for reporting. These files can be large. Typically the monthly archive is more than 1/2 gigabyte each for both the T3E and J90 at ARSC (over 2gb total). To avoid frequent references to the files, post-processing in the csarun USEREXIT (csa.user) is also recommended to extract csacon and csagcon/csagfef information. This exit can be used to reduce the Super-record to data readily usable for general and most ad hoc reports.

By default csarun provides some general reports which are similar to the standard UNIX runacct reports. One very useful extension are the cacct files reduced from the Super-record generated by the csacon execution in the REPORT phase of csarun. The cacct files are never deleted by csa and have sufficient information to report resource utilization (CPU and otherwise) both aggregated and by project. The primary resource reports at ARSC utilize post-processing of the cacct files. Since the files are never removed by csa, historical processing is possible once a site develops appropriate reduction tools.

Enhanced reporting from the cacct files is done at ARSC with a script called from the csa.user exit to consolidate a month-to-date project summary file. The script uses csacrep for extraction and the ARSC uacrep program for reporting. The data consolidated from this source is used to generate monthly billing reports and is also available for end-user on-demand reports.

The expansion factor reporting (aggregate, project, and queue) is produced by csagcon and csagfef/nawk post-processing of the session Super-record file by a script invoked from the csa.user exit. For resource utilization reports csagcon/csagfef are much more flexible than csajrep. The csagfef program providess a nawk-like syntax for writing custom reports.

All ARSC scripts and programs referenced above are available at 'ftp.alaska.edu/pub/sois/' as part of the uakpacct distribution.


Lessons (ToC - Objectives - Methods - Lessons - References)

* * *
There are a variety of things which accounting data is good for:
  1. The poor man's audit:
    Granted, audit logs are better, but if you have nothing else to work with accounting data can provide clues. From "ls -l" (below also using an enhanced ls named uals to get the second the file was modified) one can determine when a file was modified. Often (certainly not always), files are written followed by a process termination, who likely modified nqs.before:
    chilkoot: ls -l nqs.before
    -rw-r-----   1 kcarlson staff      14394 May  4 08:01 nqs.before
    chilkoot: uals -Ee %Y-%m-%d\ %H:%M:%S nqs.before
    2000-05-04 08:01:35 nqs.before
    chilkoot: uakpacct -T 08:01:35 -at 5 -80
    #End_Date/Time Start_hh:mm:ss  Userid   Command    CPU_Sec :tty(hex) flg sig
    #-------------       --------  ------   -------    ------- :-------- --- ---
    05/04 08:01:36       08:01:21  kcarlson vi             0.8 :00000d0e 000   0
    05/04 08:01:37       08:01:37  kcarlson uals           0.1 :00000d0e 000   0
    
    The uakpacct is looking for process terminations within 5 seconds of the file modify time. Yes, kcarlson was the one who modifed nqs.before.
  2. Event sequencing:
    When looking at a problem involving multiple jobs, accounting can be used to sequence the events:
    platinum: uakpacct -event +tty ffffffff | sort | more
    ---- ----   ____ -------  ------   -----    ---      --- --- -------- --------    -------
    0503@180739 07f1:xlock   :#1561    #206    :003c0001
    0504@080356 07d5:uals    :#1561    #206    :003c0000
    0504@080356 07d5:uals    :#1561    #206    :003c0000 040   0:     0.0      0.0        0.1
    0504@080453 07d6:rcp     :#1561    #206    :003c0000
    0504@080456 07d6:rcp     :#1561    #206    :003c0000 042   0:     0.0      0.1        3.0
    0504@081815 07f1:xlock   :#1561    #206    :003c0001 040   0:     2.6      5.0    51036.2
    0504@081844 07f2:rlogin  :#1561    #206    :003c0000
    0504@081849 07f2:rlogin  :#1561    #206    :003c0000 040   2:     0.0      0.0        4.7
    0504@081856 07f3:ping    :#1561    #206    :003c0000
    0504@081859 07f3:ping    :#1561    #206    :003c0000 042   0:     0.0      0.0        3.5
    0504@082624 084f:xlock   :#1561    #206    :003c0001
    ...
    Date Time   seq# Command  userid   group    tty      flg sig ucpu_sec scpu_sec    elapsed 
    
    The above is not from a "real" problem, it just demonstrates activity from two tty's (one primarily in xlock) requesting uakpacct to write start and stop records then sorting.
  3. How did a process die:
    This example comes from Irix. The problem is Macintosh kerberized BetterTelnet sessions on Octanes disappearing if left idle.
    mcgrew: uaklogin -p $$ # what does a live session look like?
    05/04@172419
       _pid__ _ppid_ __user__ Started   VSZ  SZ  P CPU %CPU |__Command__
    
     0      1      0 root       Apr 27 1712 107  * 0:05   0 |/etc/init
    -1    277      1 root       Apr 27 1872 117  * 0:10   0 |/usr/etc/inetd
    -2 176405    277 root     17:12:59 3200 200  * 0:00   0 |telnetd -a valid
    -3 176048 176405 root     17:13:06 2816 176  * 0:00   0 |login -h brr.arsc.edu -p -F kcarlson
    -4 174739 176048 kcarlson 17:13:06  864  54  * 0:00   0 |-ksh
    
    mcgrew: uakpacct -u kcarlson -80 -c sh	# find the session which died
    #End_Date/Time Start_hh:mm:ss  Userid   Command    CPU_Sec :tty(hex) flg sig
    #-------------       --------  ------   -------    ------- :-------- --- ---
    05/04 14:14:34       14:14:34  kcarlson sh             0.0 :003c0003 041   0
    05/04 16:38:19 04/27 15:25:42  kcarlson sh             0.2 :003c0002 040   1
    
    kcarlson@mcgrew: uakpacct -T 16:38:19 -at 2 -80	# find parents
    #End_Date/Time Start_hh:mm:ss  Userid   Command    CPU_Sec :tty(hex) flg sig
    #-------------       --------  ------   -------    ------- :-------- --- ---
    05/04 16:38:17 04/27 15:25:40  root     telnetd        2.2 :ffffffff 042   0
    05/04 16:38:19 04/27 15:25:42  #1561    sh             0.2 :003c0002 040   1
    05/04 16:38:19 04/27 15:25:42  root     login.kr       0.0 :003c0002 042   1
    
    It appears telnetd is dying and causing a SIGHUP to the shell. From this we know where to start the debugging effort.
  4. What was running when a PE died:
    When PE's die (they do, don't they), the console messages are cryptic on which process got whacked:
    
    31 Oct 01:28:25 (PE:0x10c) WARNING: Oct 31 01:28:04(AKS) (GPM)
    		Pid 140078 "main.0" (uid 2008) killed due to pe failure
    
    You cannot use "uakpacct -pid" because individual PEs do not write accounting records, their use is summarized in the MPP records. However, it is still easy to quickly identify what happened:
    yukon: uakpacct -c \*main.0\*
        Date Time    jobid userid group  Command    flg sig ucpu_sec scpu_sec elapsed pid   ppid
        ---- ----    ----- ------ -----  -------    --- --- -------- -------- ------- ---   ----
    19991030@225025: 12781:zxy52b #1924 :main.0   Y:006   9:401847.9   2650.0 13059.5 25852 25840
    19991030@225025/ 12781:zxy52b:#1924 /MPP-Record /32 mppe
    19991031@012813: 13345:zxy52b #1924 :main.0   Y:006   0:664293.9   4388.0 20912.8 53009 52997
    19991031@012813/ 13345:zxy52b:#1924 /MPP-Record /32 mppe
    yukon: /usr/local/adm/bin/uakpacct -job 12781,13345 -f /usr/adm/acct/day/nqacct
    19991031@012808/ 12781:zxy52b/ 42903/TERM:noper rerun yukon       run214b      large
    19991031@012809/ 13345:zxy52b/ 42903/INIT:started     yukon       run214b      large
    19991031@071650/ 13345:zxy52b/ 42903/TERM:exited      yukon       run214b      large
    
  5. What was running in Oracle (non-Cray example):
    This is really a non-Cray example. In a previous life we automated a nightly "uakpacct -mgmt" report requesting all processes which consumed more than 15 minutes (900 seconds) of CPU from each week day's pacct files to identify possible mis-use and to identify application code which may require tuning:
    glacier: uakpacct -cpu 900 -f "`ls /var/adm/pacct*`" -m
    #End_Date/Time Start_hh:mm:ss  Userid   Command CPU_Sec  Elapsed Rate
    --------------       --------  ------   ------- -------  ------- ----
    09/04 10:42:03       08:45:06  jnblb    oracle   1026.9   7017.0 14.6
    09/04 12:25:04 09/03 12:00:32  sxfinp   oracle   4428.2  87872.0  5.0
    
    From the example above, determine what jnblb was doing in Oracle. Since Oracle children tend to start or end at the same time as their parents, by filtering for processes within 30 seconds (Oracle servers may live longer lives) we can get a good clue as to what was being executed under Oracle:
    glacier: uakpacct -f /var/adm/pacct6 -m -r 0 -T 10:42:03 -t 08:45:06 -at 30
    #End_Date/Time Start_hh:mm:ss  Userid   Command CPU_Sec  Elapsed Rate
    --------------       --------  ------   ------- -------  ------- ----
    09/04 10:42:03       08:45:06  jnblb    oracle   1026.9   7017.0 14.6
    09/04 10:42:04       08:45:05  jnblb    RPBVLDT     3.1   7019.0  0.0
    09/04 10:42:05       08:45:00  jnblb    G0.19970    0.2   7025.0
    09/04 10:42:05       08:45:00  jnblb    ua_gur_r    0.0   7025.0
    09/04 10:42:05       08:44:57  jnblb    ksh         0.3   7028.0
    
    The user was executing RPBVLDT program which may need modifications to more efficiently execute.
  6. And more (use your imagination)...
Where accounting data does not help: (Lessons)
  1. Running processes have not written accounting records. Use ps or use uaklogin which combines ps with a process tree and with login information from the utmp file and from the tty device files:
    chilkoot: uaklogin -u kcarlson
    05/04@144242
    __Entry Time_:__pid__:__user__:__Access___:__Modify___:_id_:__line__:__type__:_trm_ext:__login host__
    
    05/04@080039 :  46228:kcarlson:   00.00.00:   00.00.00:p014:ttyp014 :User    :   0,  0:brr.arsc.edu
    
    chilkoot: uaklogin -u kcarlson -ps
    05/04@144216
    __Entry Time_:__pid__:__user__:__Access___:__Modify___:_id_:__line__:__type__:_trm_ext:__login host__
    
           _pid__  _ppid_ __user__ Started    VSZ   SIZE   CPU_Time %CPU |__Command__
    
    05/04@080039 :  46228:kcarlson:   00.00.00:   00.00.00:p014:ttyp014 :User    :   0,  0:brr.arsc.edu
    
        3   46228   46203 kcarlson     -      592     148       3.65     | [ksh]
        4   64498   46228 kcarlson     -      496     124       0.04     | [uaklogi]
        5   64499   64498 kcarlson     -     6560    1640       0.13     | [ps]
        4   64496   46228 kcarlson     -      608     152       4.99     | [egrep]
    
    Non-terminal processes:
    
       -1   20294       1 kcarlson     -      944     236       2.91     | [xgraph]
    
  2. If the system unexpectedly dies (like a power outage), accounting records are not written.
  3. Accounting only logs 8 to 16 bytes of the command name and no path information.
  4. One must understand process interactions and apply some logical heuristics to relate processes in most flavors of UNIX. Unicos[/mk] does provide pid and ppid which does simplify matters.
  5. The time resolution is only to seconds.
  6. Both Cray csarun and standard UNIX runacct by default dispose of the pacct files in nightly processing. Use csa.archive1 on Crays and modify runacct elsewhere to preserve the files for a finite period of time (like a backup cycle), such as:
    
    	# rm -f ${_adm}/Spacct*.${_date}
    	find ${_adm}/Spacct*.* -mtime +7 -exec rm -f {} \;
    
There are a variety of information sources on system usage besides the accounting records. The wtmp file will have a record of logins, as will syslog if tcp wrappers or some other logging has been enabled. NQS log files will have records of batch jobs. If auditing has been configured, additional information such as file opens and closes may be logged in the slogfile (see "man reduce").

* * *
Constructing a session Super-record file of known data is useful for validating changes in post-processing reporting programs as well as for understanding the data. An example of 'Making a Super-record' is included, that particular example demonstrates where the csagcon nq_wallclock may be mis-reported, which will affect expansion factor calculations.

Other session Super-record files can be built for other reasons. On 20 April 2000 at 11:20am ARSC experienced an unexpected power outage. Needless to say, a number of jobs were interrupted during execution. This event provided a good opportunity to examine what exception records (rerun, restart, and checkpoint) look like. The full example of extracting these records will be included in the uakpacct-v3.2 distribution.

On the T3E extract and examine several jobs impacted by the outage:


yukon: S="60308,60317,60319,60334,60337"
yukon: uakpacct -f "`ls -1 2000042?.0001/Wnqacct1`" -seq $S -bin nqacct1 -out nq
yukon: sort -k1.35,1.43 -k1 nq | egrep 'INIT|TERM|New'
20000427@023321/     0:hav14a  / 60308/RECV:New       yukon    runhski    gcp
20000427@054734/ 58498:hav14a  / 60308/INIT:started   yukon    runhski    gcp_xxlarge
20000427@073950/ 58498:hav14a  / 60308/TERM:held      yukon    runhski    gcp_xxlarge
20000427@074147/ 58498:hav14a  / 60308/INIT:restarted yukon    runhski    gcp_xxlarge
20000427@163624/    19:hav14a  / 60308/INIT:rerun     yukon    runhski    gcp_xxlarge
20000427@194806/    19:hav14a  / 60308/TERM:exited    yukon    runhski    gcp_xxlarge
20000427@045911/     0:pan34x  / 60317/RECV:New       yukon    s.cmd      gcp
20000427@065504/ 58692:pan34x  / 60317/INIT:started   yukon    s.cmd      gcp_xxlarge
20000427@073822/ 58692:pan34x  / 60317/TERM:held      yukon    s.cmd      gcp_xxlarge
20000427@074158/ 58692:pan34x  / 60317/INIT:restarted yukon    s.cmd      gcp_xxlarge
20000427@163624/    20:pan34x  / 60317/INIT:rerun     yukon    s.cmd      gcp_xxlarge
20000428@000457/    20:pan34x  / 60317/TERM:exited    yukon    s.cmd      gcp_xxlarge
20000427@050836/     0:lin20c  / 60319/RECV:New       yukon    run.unst05 mpp
20000427@061037/ 58561:lin20c  / 60319/INIT:started   yukon    run.unst05 medium
20000427@073459/ 58561:lin20c  / 60319/TERM:held      yukon    run.unst05 medium
20000427@103947/ 58561:lin20c  / 60319/INIT:restarted yukon    run.unst05 medium
20000427@163658/ 58561:lin20c  / 60319/INIT:restarted yukon    run.unst05 medium
20000427@180720/ 58561:lin20c  / 60319/TERM:exited    yukon    run.unst05 medium
20000427@065327/     0:dev07d  / 60334/RECV:New       yukon    proms3d    mpp
20000427@071436/ 58749:dev07d  / 60334/INIT:started   yukon    proms3d    xlarge
20000427@073023/ 58749:dev07d  / 60334/TERM:held      yukon    proms3d    xlarge
20000427@163817/ 58749:dev07d  / 60334/INIT:restarted yukon    proms3d    xlarge
20000427@233245/ 58749:dev07d  / 60334/TERM:exited    yukon    proms3d    xlarge
20000427@071436/     0:mas26x  / 60337/RECV:New       yukon    pips128_s2 gcp
20000427@074058/ 58829:mas26x  / 60337/INIT:started   yukon    pips128_s2 gcp_xxlarge
20000427@224414/  1038:mas26x  / 60337/INIT:rerun     yukon    pips128_s2 gcp_xxlarge
20000428@051025/  1038:mas26x  / 60337/TERM:exited    yukon    pips128_s2 gcp_xxlarge
yukon: cut -c17-22 nq | sort -u | grep -v " 0" | while read F; do
>	 if [ -z "$X" ]; then X=$F; else X="$X,$F"; fi; done
yukon: uakpacct -f "`ls -1 2000042?.0001/Wpacct1`" -job $X -bin pacct1 -out /dev/null 
Both nqacct and pacct records for five jobs were extracted to create a Super-record for reporting, the five jobs:

seqno Owner    Start End     elapse characteristics
----- -----    ----- ---     ------ ---------------
60308 hav14a   02:23 19:48   17:25  started, held, restarted, rerun
60317 pan34x   04:59 00:04+1 19:05  started, held, restarted, rerun
60319 lin20c   05:08 18:07   12:59  started, held, restarted, restarted
60334 dev07d   06:53 23:32   16:39  started, held, restarted
60337 mas26x   07:14 05:10+1 21:56  started, rerun
Four of the jobs were held (checkpointed) prior to the outage. It is common at ARSC for T3E jobs to be checkpointed to reschedule priority work or to optimize the PE usage. Three of the jobs were re-rerun, one of those had never been check pointed. One job was checkpointed at time of outage, one restarted from the checkpoint after the outage, and two were rerun...why? From the nqs log:

04/27 16:36:36 I shepherdprocess(): Request 60308.yukon: Attempting to restart request
04/27 16:36:36 I shepherdprocess(): Request 60317.yukon: Attempting to restart request
04/27 16:36:07 I shepherdprocess(): Request 60319.yukon: Attempting to restart request
...
04/27 16:36:08 W restartr(): Unable to restart job from file=/ckpt/nqs/chkpnt/++DiJ+++++1, error=50.
04/27 16:36:08 W restartr(): Errno 50 = File changed.
04/27 16:36:08 W nqs_restart(): Request 60308.yukon: NOT restarted, errno=50, File changed.
04/27 16:36:08 W nqs_restart(): Request 60308.yukon: Unrecoverable restart error.
04/27 16:36:08 I nqs_reqexi(): Request 60308.yukon: Exiting from queue gcp_xxlarge.
04/27 16:36:08 W shoreqmsg(): Request 60308.yukon: Server reports RCM_RESTARTERR.
04/27 16:36:08 I nqs_mai(): Request 60308.yukon: Sending requeued mail to hav14a  >.
04/27 16:36:08 I nqs_mai(): premail created by fork, pid=1456, ppid=1449.
04/27 16:36:09 I nqs_mai(): Calling system function execve(2), file=/usr/bin/mailx.
04/27 16:36:09 I reqcom(): Request 60308.yukon: Exiting from queue gcp_xxlarge.
04/27 16:36:09 I reqcom(): Component shepherd exiting, pid=1449, rc=2.
...
04/27 16:36:09 W nqs_restart(): Request 60317.yukon: NOT restarted, errno=50, File changed.
04/27 16:36:09 W nqs_restart(): Request 60317.yukon: Unrecoverable restart error.
...
04/27 16:36:58 I enf_quotas(): Request 60319.yukon: Resumed.
04/27 16:36:58 I nqs_restart(): Request 60319.yukon: Restarted, ID=58561.
In the the two checkpoint-rerun cases the user or job modified a file making restart impossible. For restart to work, the inode of all files must be intact.

What do the jobs look like in resource reports:


yukon: /usr/lib/acct/csagcon -S Super-record -ujac -R $SDIR/csagcon.yukon -o csagcon.out
yukon: /usr/lib/acct/csagfef -f csagcon.out -D queue=0 -D peseg=0 $SDIR/csagfef.yukon
#
#nq_quename           pm_t_pe_max pb_t_   pm_t_pe pm_t_pe nq_wall nq      T:1   nq_   nq_reqname
#                                usr:time  _time   _ctime   clock _qwtime       seqno
#                             PEs CPUtime MPPtime DEDtime WallClk Q_Wait  Expan
#Queue Name Project  Userid   max (hours) (hours) (hours) (hours) seconds Fact. SeqNo RequestName
#========== =======  ======   === ======= ======= ======= ======= ======= ===== ===== ===========
gcp_xxlarge NSWCC004 hav14a    42 207.078 207.077 211.882  17.246   11770  1.23 60308 runhski
gcp_xxlarge ONRDCC28 pan34x    50 403.066 403.064 406.366  19.096    7169  1.12 60317 s.cmd
medium      ONRDC084 lin20c    20  55.751  55.750  56.406  12.978   14809  1.46 60319 run.unst05
xlarge      NRLDC213 dev07d    64 442.698 442.697 456.202  16.655   34143  2.32 60334 proms3d
gcp_xxlarge NPSCAC43 mas26x   128 773.241 773.239 819.958  21.930    1582  1.02 60337 pips128_s2
*root       root     .          0   0.000   0.000   0.000    .       .      .
*minimum    minimum  .          6   0.002   0.000   0.000    .       .      .
#                            ==== ======= ======= ======= ======= ======= =====
#    11:    5:nqs     0:tty     5  1881.8  1881.8  1950.8    87.9   69473  1.20
Examining the above shows seqno=60337 with an expansion factor of 1.02; that cannot be right since the T3E was down from 11:20 until 16:35. The csa consolidation does not detect the abrupt outage to calculate the 5+ hours of down time or the rerun time as queue wait (or is that appropriately queue wait?). The other jobs which were running at the time of the outage reflect a similar problem, seqno=60319 had initial wait and held time of 4:07 hours which matches the reported queue wait, the outage time and rerun time was not counted.

The job which was held through the outage, seqno=60334, has a correct expansion factor representing the time waiting to start (0:21 hours) and the time held (9:08 hours) with an actual execution of initially 16 minutes in the morning and 6:54 hours in the evening.

Unfortunately, there is no automated means around the mis-reported expansion factors for jobs executing at the time of the outage. An alternative method of expansion factor calculation can be used defining the "execution time" as:
     DEDtime / PEs_max_ + CPUtime - MPPtime
instead of subtracting the queue wait time from the wall clock time in system. Using this method, we see expansion factors of:


yukon: /usr/lib/acct/csagfef -f csagcon.out -D queue=0 -D peseg=0 -D expan=2 $SDIR/csagfef.yukon
#
#nq_quename           pm_t_pe_max pb_t_   pm_t_pe pm_t_pe nq_wall nq      T:2   nq_   nq_reqname
#                                usr:time  _time   _ctime   clock _qwtime       seqno
#                             PEs CPUtime MPPtime DEDtime WallClk Q_Wait  Expan
#Queue Name Project  Userid   max (hours) (hours) (hours) (hours) seconds Fact. SeqNo RequestName
#========== =======  ======   === ======= ======= ======= ======= ======= ===== ===== ===========
gcp_xxlarge NSWCC004 hav14a    42 207.078 207.077 211.882  17.246   11770  3.42 60308 runhski
gcp_xxlarge ONRDCC28 pan34x    50 403.066 403.064 406.366  19.096    7169  2.35 60317 s.cmd
medium      ONRDC084 lin20c    20  55.751  55.750  56.406  12.978   14809  4.59 60319 run.unst05
xlarge      NRLDC213 dev07d    64 442.698 442.697 456.202  16.655   34143  2.33 60334 proms3d
gcp_xxlarge NPSCAC43 mas26x   128 773.241 773.239 819.958  21.930    1582  3.42 60337 pips128_s2
*root       root     .          0   0.000   0.000   0.000    .       .      .
*minimum    minimum  .          6   0.002   0.000   0.000    .       .      .
#                            ==== ======= ======= ======= ======= ======= =====
#    11:    5:nqs     0:tty     5  1881.8  1881.8  1950.8    87.9   69473  3.17
While those are more realistic expansion factors for these jobs, this method in general would exclude I/O wait time and could be grossly wrong for any job containing multiple mpprun commands. Fortunately, power outages are very rare to cause such anomalies, at least at ARSC.

What do the MPP and process accounting records look like for these jobs:


yukon: uakpacct -f pacct1 -Ro -80
#End_Date/Time Start_hh:mm:ss  Userid   Command    CPU_Sec :tty(hex) flg sig
#-------------       --------  ------   -------    ------- :-------- --- ---
04/27 07:30:22       07:14:45  dev07d   proms3d    49871.5 :         006   9
20000427@071445/   58749:dev07d  :#2130   /MPP-Record      /64 mppe
04/27 07:34:59       06:15:43  lin20c   unclem     93735.9 :         006   9
20000427@061543/   58561:lin20c  :#1888   /MPP-Record      /20 mppe
04/27 07:38:22       06:57:38  pan34x   x         115135.0 :         006   9
20000427@065738/   58692:pan34x  :#2111   /MPP-Record      /50 mppe
04/27 07:39:49       05:47:40  hav14a   x         275979.9 :         006   9
20000427@054740/   58498:hav14a  :#2112   /MPP-Record      /42 mppe
04/27 18:07:05       16:36:58  lin20c   unclem    106963.3 :         006   0
20000427@163658/   58561:lin20c  :#1888   /MPP-Record      /20 mppe
04/27 19:47:30       16:36:30  hav14a   x         469498.7 :         006   0
20000427@163630/      19:hav14a  :#2112   /MPP-Record      /42 mppe
04/27 23:32:43       16:38:17  dev07d   proms3d  1543836.0 :         006   0
20000427@163817/   58749:dev07d  :#2130   /MPP-Record      /64 mppe
04/28 00:04:48 04/27 16:36:30  pan34x   x        1335895.6 :         006   0
20000427@163630/      20:pan34x  :#2111   /MPP-Record      /50 mppe
04/28 05:10:18 04/27 22:45:56  mas26x   cpop.128 2783660.8 :         004   0
20000427@224556/    1038:mas26x  :#2085   /MPP-Record      /128 mppe
Each of the four checkpoints wrote MPP and process records with a sig=9 which is SIGKILL (see sys/signal.h), this is apparently a signature of checkpointing on the T3E. Note, we have "accountable" CPU time at this point. There are also process and MPP records for each job after the outage.

Are the CPU times accurate in the resource reports after consolidation? From the pacct records converting seconds to hours we get:


                pre-   post-
seqno Owner     hold   hold     total   characteristics
----- -----     ----   ----     -----   ---------------
60308 hav14a    76.6 + 139.4  = 207.0   started, held, restarted, rerun
60317 pan34x    32.0 + 371.0  = 403.0   started, held, restarted, rerun
60319 lin20c    26.0 +  29.7  =  55.7   started, held, restarted, restarted
60334 dev07d    13.8 + 428.9  = 442.7   started, held, restarted
60337 mas26x     n/a + 773.2  = 773.2   started, rerun
This does match exactly the csagfef output. Is this correct?
Technically, yes... no CPU is charged for jobs killed by the power outage. However, the restarted-rerun jobs are charged for the initial check-pointed CPU time and the rerun re-execution CPU time.
Is this fair?
Well, the jobs were "mis-coded" if the restart cannot find an inode, so maybe. However, if a job were to subsequently exceed its requested CPU and be killed because of this double billing, maybe not. In the end, one must understand the behavior of the system in these circumstances to implement the appropriate site "billing" policies.

Above we covered some T3E examples rather exhaustively. The same outage affected the J90, without the extraction detail:


seqno Owner     Start       End         elapse  characteristics
----- -----     -----       ---         ------  ---------------
58371 wtg48y    04/24@05:31 04/28@06:13 4+00:42 started, rerun
58394 yz49      04/24@08:45 04/27@19:46 3+11:01 started, rerun
58607 tom43     04/26@08:56 04/28@04:30 1+19:34 started
58685 sha40x    04/27@01:54 04/28@18:35 1+16:41 started, rerun
58690 zed51a    04/27@09:39 04/27@16:08 0+06:29 started, rerun
There is an obvious difference on the J90 that jobs may run for several days (or weeks). At ARSC J90 jobs are rarely checkpointed, only very long running jobs and prior to system maintenance (which is held to a minimum). At the time of the power outage on April 20, the last J90 reboot was on January 1 after the Y2k shutdown. Consequently, all executing jobs were re-run from beginning. For really long running jobs this is an argument for preventative check-pointing. What do the resource reports look like:

  chilkoot: /usr/lib/acct/csagfef -f csagcon.out -D queue=0 $SDIR/csagfef.chilkoot
  #
  #nq_quename  pb_t_phimem_max pb_t_   pu_p_  pe_t_je nq_    pb_t    ctime cpu*m je|pb nq_   nq_reqname
  #                            us:time mutime  time*  qwtime  _ctime                   seqno
  #                      MWrds CPUtime mutime WallClk Q_Wait Connect Expansion-Factors
  #Queue Project  Userid HiMem (hours) hours  (hours) (secs) (hours) Conn.  CPU  Lapse SeqNo RequestName
  #===== =======  ====== ===== ======= ====== ======= ====== ======= ===== ===== ===== ===== ===========
  Small  NWUMCD   wtg48y  29.7   8.202    6.8  96.698      2   6.791 14.24 11.79  1.00 58371 gb
  Small  NWUMCD   yz49    18.2   6.817    3.3  83.019   1743   3.332 24.91 12.18  1.01 58394 B
  Medium NSWCC002 tom43  106.9   5.493    0.0  43.569 109201   5.409  8.06  7.93  3.29 58607 iflow
  Small  HURRICAN sha40x   4.9  10.775   10.4  40.694   6773  10.427  3.90  3.78  1.05 58685 CHILBAL28f31loo
  Small  ONRDC158 zed51a  51.9   0.651    0.0   6.483   3691   0.630 10.28  9.96  1.19 58690 ns.grid
  *root  root     .          0   0.000    0.0    .         .   0.000   .     .     .
  *min   minimum  .         10   0.001    0.0    .         .   0.000   .     .     .
  #===== =======  ====== ===== ======= ====== ======= ====== ======= ===== ===== =====
  #   15   5:nqs   0:tty     5    31.9   20.5   270.5 121410    26.6 10.17  8.47  1.14
With the J90 we calculate three expansion factors:
  1. By connect time to one or more CPU's (connect)
  2. By CPU time, may be less than 1 with multi-tasking (CPU)
  3. By execution time (lapse)
A fourth way to calculate using the average multi-tasking connect time may eventually be implemented... this method would be better in preserving the problem size for weighting the job, but would still be inaccurate for I/O wait and other contention. The lapse method is inaccurate with the outage, like the T3E, because the queue wait time does not reflect the down-time nor the re-execution time.

What do the process accounting and multi-tasking records look like?


chilkoot: uakpacct -cpu 300 -f pacct1 -mut
    Date Time   jobid userid   group  Command  flg sig ucpu_sec scpu_sec    elapsed    pid   ppid
    ---- ----   ----- ------   -----  -------  --- --- -------- --------    -------    ---   ----
20000427@104128:16779:zed51a   #2148 :a.out   :000   0:   484.7      6.4     1394.0  72685  72666
20000427@151708:   41:zed51a   #2148 :a.out   :000   0:   480.9      4.1      905.8   1709   1503
20000427@153341:   41:zed51a   #2148 :a.out   :000   0:   428.5      7.6      754.1   3096   1503
20000427@151646:   39:yz49     #974  :flapw.9 :004   9: 24093.4    439.3    16179.9   1656   1521
20000427@151646/   39:yz49    :#974  /Multi-task: 4|12/ 24093-u  11990-c 1*5102 2*3221 3*2118 4*1549
20000427@151644:   48:tom43    #1708 :iflow36 :000   0: 19468.3    303.6    47640.6   1584   1500
20000427@151644:   38:wtg48y   #974  :run     :004   9: 28784.3    739.5    53818.6   1601      1
20000427@151644/   38:wtg48y  :#974  /Multi-task: 8|12/ 28782-u  24439-c 1*22764 2*677 3*328 4*196 5*146 6*161 7*139 8*29
20000427@151644:   43:sha40x   #912  :CHILBAL2:004   0: 37531.3   1249.1    98333.3   1599   1510
20000427@151644/   43:sha40x  :#912  /Multi-task: 4|12/ 37531-u  37530-c 1*37530 2*0 3*0 4*0

The zed51a job, seqno=58690, did have significant processing (in terms of its aggregate use) prior to the outage which was effectively "double billed". Note, in the above only processes greater than 300 CPU seconds were requested, zed51a did have other shorter processes to aggregate to 0.630 hours, for example:


chilkoot: uakpacct -cpu 20 -f pacct1 -u zed51a -80
#End_Date/Time Start_hh:mm:ss  Userid   Command    CPU_Sec :tty(hex) flg sig
#-------------       --------  ------   -------    ------- :-------- --- ---
04/27 11:04:42       10:41:28  zed51a   a.out        491.1 :ffffffff 000   0
04/27 11:06:10       11:04:43  zed51a   cft90         35.9 :ffffffff 000   0
04/27 15:32:14       15:17:08  zed51a   a.out        485.0 :ffffffff 000   0
04/27 15:33:22       15:32:15  zed51a   cft90         35.1 :ffffffff 000   0
04/27 15:46:15       15:33:41  zed51a   a.out        436.1 :ffffffff 000   0
04/27 15:47:40       15:47:04  zed51a   cft90         20.4 :ffffffff 000   0
04/27 15:50:56       15:48:02  zed51a   cft90        109.2 :ffffffff 000   0
04/27 15:53:24       15:51:17  zed51a   cft90         72.5 :ffffffff 000   0
04/27 15:55:56       15:55:10  zed51a   cft90         25.8 :ffffffff 000   0
04/27 15:57:24       15:56:26  zed51a   cft90         34.6 :ffffffff 000   0
04/27 15:59:11       15:58:29  zed51a   cft90         26.3 :ffffffff 000   0
04/27 16:01:19       15:59:12  zed51a   cft90         71.8 :ffffffff 000   0
04/27 16:03:09       16:01:56  zed51a   cft90         39.3 :ffffffff 000   0
04/27 16:07:16       16:04:16  zed51a   cft90         90.5 :ffffffff 000   0

* * *
Expansion factors are a common and useful measure for batch through-put ("response time" or "turnaround") on an HPC system. As we saw above with the rerun records from both the J90 and T3E, expansion factors can be mis-reported. There are a variety of other problems with reporting and comparing expansion factors. For an overview and discussion see 'Expansion Factors (or Expansion Fiction)'. Expansion factors without additional knowledge of a system, its workload, and its job mix are meaningless.


References (ToC - Objectives - Methods - Lessons - References)

Both of the above packages are available from 'ftp.alaska.edu/pub/sois/'. The uakpacct kit contains the csa post-processing tools described above. Both packages are "as is", both are running under Unicos, Unicos/mk, Irix, and Digital UNIX (whatever its name presently is). They are likely to be ported to Solaris and Linux eventually.