Last Update 2000-05-15 by kcarlson
Table of Contents:
|
Noordwijk, The Netherlands Tuesday 23 May 2000 14:30 Track C: Operations Kurt Carlson Arctic Region Supercomputing Center University of Alaska ![]() ARSC |
Daily, weekly, monthly, and historical reports on job, queue, project, and aggregate expansion factors are generated from csagcon, csagfef, and nawk post-processing of the daily session Super-record files generated by csarun. An ad hoc example:
Monthly and demand month-to-date CPU and other system resource utilization reports by project and aggregate are generated by post-processing of the daily cacct files generated by csacon using the ARSC uacrep program. An ad hoc example:yukon: nawk -f /usr/local/adm/sbin/csanawk.yukon 200004 # num exp Expan CPUtime MPPtime DEDtime PE-Rsrv Execute Q_Wait #Queue Name jobs jobs Fact. (hours) (hours) (hours) (hours) (hours) (hours) #========== ==== ==== ===== ======= ======= ======= ======= ======= ======= *minimum 99746 0 . 91.631 0.3 0.8 0.8 0.0 0.0 *root 16239 0 . 304.569 0.0 0.0 0.0 0.0 0.0 Interactive 689 0 . 80.034 85.7 113.2 113.2 0.0 0.0 Qgrand 12 12 104 0.018 261.6 308.6 380.7 1.5 153.0 Qlarge 62 62 7.79 0.039 320.2 341.1 372.8 8.7 62.8 Qmedium 23 23 3.28 0.024 48.1 49.4 58.6 3.7 8.4 Qsmall 59 55 2.13 0.575 55.8 66.3 73.8 10.5 12.0 Qxlarge 15 15 6.71 0.021 170.4 174.1 195.5 2.6 14.9 Qxxlarge 10 10 3.42 0.009 217.7 224.6 243.2 1.9 4.7 System 1 0 . 0.458 0.0 0.0 0.0 0.9 0.0 gcp_grand 1 1 12.75 0.001 883.0 890.9 891.7 3.5 40.9 gcp_xxlarge 362 359 1.54 0.469 93408.2 98651.2 102061.6 1274.8 694.8 grand 2 2 1.64 0.001 700.6 701.0 710.2 4.1 2.6 large 157 155 2.52 1.821 16174.0 16260.9 16482.1 401.5 675.1 medium 153 152 1.61 2.026 12070.5 12291.9 12504.1 704.5 437.5 single 41 0 . 23.166 0.0 0.0 0.0 70.0 0.3 small 162 139 1.60 24.845 2689.8 2722.7 2771.3 419.8 239.9 xlarge 54 53 2.89 1.048 21405.3 21731.7 21960.7 291.0 544.0 xxlarge 14 14 1.54 0.075 6110.3 6199.8 6342.4 51.5 28.5 #========== ==== ==== ===== ======= ======= ======= ======= ======= ======= # 1188,689 1877 1052 2.15 530.830 154601.5 160728.4 165162.8 3250.5 2919.5
Both of the report methods above utilize consolidated text data files which can be used for a variety of other ad hoc "what if" reports as well as for any general "billing" report requirements. The process is to consolidate raw data into a readily usable form.yukon: uacrep -level 2 -h -f "`ls -1r 20000?.Sum`" -show mpp #Host: yukon #Tags: 200004 200003 200002 200001 # User MPP Count Hours 200004 200003 200002 200001 ----- ------ ------ ------ ------ ------ 57 579374 85.2% 151144 153628 141294 133308 DOD 2 52035 7.7% 9874.0 14374 14568 13218 DOD/ARMY 27 379904 55.9% 106170 94679 92594 86461 DOD/CHALLENG 28 147436 21.7% 35100 44575 34132 33628 DOD/NAVY 31 ROOT 62 100556 14.8% 9584.4 27646 25176 38149 UA 34 99094 14.6% 9502.5 26885 24623 38083 UA/ACADEMIC 25 456.8 8.8 408.7 34.7 4.6 UA/ARSC 1 UA/FEDERAL 2 1005.3 .1% 73.1 352.2 518.7 61.3 UA/IARC ----- ------ ------ ------ ------ ------ 150 679930 160728 181274 166470 171457 Totals #
For resource utilization reporting the raw files must be aggregated/consolidated in some manner. Under Unicos[/mk] this is done with csarun which is summarized in 'csarun overview'. This job consolidates the raw per-process accounting records into job records on a daily basis and uses carry-forward over successive runs (days) until jobs complete. The consolidated session Super-record files can then be used for general reporting.
The Cray provided csa tools provide no data archiving, but they do provide exit points for sites to define their own. The ARCHIVE1 (csa.archive1) exit is intended for saving raw accounting files and the ARCHIVE2 (csa.archive2) exit is intended for saving the session Super-record files. Both the raw and consolidated files should be saved. The raw files are useful for problem analysis and the Super-record files are useful for reporting. These files can be large. Typically the monthly archive is more than 1/2 gigabyte each for both the T3E and J90 at ARSC (over 2gb total). To avoid frequent references to the files, post-processing in the csarun USEREXIT (csa.user) is also recommended to extract csacon and csagcon/csagfef information. This exit can be used to reduce the Super-record to data readily usable for general and most ad hoc reports.
By default csarun provides some general reports which are similar to the standard UNIX runacct reports. One very useful extension are the cacct files reduced from the Super-record generated by the csacon execution in the REPORT phase of csarun. The cacct files are never deleted by csa and have sufficient information to report resource utilization (CPU and otherwise) both aggregated and by project. The primary resource reports at ARSC utilize post-processing of the cacct files. Since the files are never removed by csa, historical processing is possible once a site develops appropriate reduction tools.
Enhanced reporting from the cacct files is done at ARSC with a script called from the csa.user exit to consolidate a month-to-date project summary file. The script uses csacrep for extraction and the ARSC uacrep program for reporting. The data consolidated from this source is used to generate monthly billing reports and is also available for end-user on-demand reports.
The expansion factor reporting (aggregate, project, and queue) is produced by csagcon and csagfef/nawk post-processing of the session Super-record file by a script invoked from the csa.user exit. For resource utilization reports csagcon/csagfef are much more flexible than csajrep. The csagfef program providess a nawk-like syntax for writing custom reports.
All ARSC scripts and programs referenced above are available at 'ftp.alaska.edu/pub/sois/' as part of the uakpacct distribution.
Granted, audit logs are better, but if you have nothing else to work with accounting data can provide clues. From "ls -l" (below also using an enhanced ls named uals to get the second the file was modified) one can determine when a file was modified. Often (certainly not always), files are written followed by a process termination, who likely modified nqs.before:chilkoot: ls -l nqs.before -rw-r----- 1 kcarlson staff 14394 May 4 08:01 nqs.before chilkoot: uals -Ee %Y-%m-%d\ %H:%M:%S nqs.before 2000-05-04 08:01:35 nqs.before chilkoot: uakpacct -T 08:01:35 -at 5 -80 #End_Date/Time Start_hh:mm:ss Userid Command CPU_Sec :tty(hex) flg sig #------------- -------- ------ ------- ------- :-------- --- --- 05/04 08:01:36 08:01:21 kcarlson vi 0.8 :00000d0e 000 0 05/04 08:01:37 08:01:37 kcarlson uals 0.1 :00000d0e 000 0The uakpacct is looking for process terminations within 5 seconds of the file modify time. Yes, kcarlson was the one who modifed nqs.before.
When looking at a problem involving multiple jobs, accounting can be used to sequence the events:platinum: uakpacct -event +tty ffffffff | sort | more ---- ---- ____ ------- ------ ----- --- --- --- -------- -------- ------- 0503@180739 07f1:xlock :#1561 #206 :003c0001 0504@080356 07d5:uals :#1561 #206 :003c0000 0504@080356 07d5:uals :#1561 #206 :003c0000 040 0: 0.0 0.0 0.1 0504@080453 07d6:rcp :#1561 #206 :003c0000 0504@080456 07d6:rcp :#1561 #206 :003c0000 042 0: 0.0 0.1 3.0 0504@081815 07f1:xlock :#1561 #206 :003c0001 040 0: 2.6 5.0 51036.2 0504@081844 07f2:rlogin :#1561 #206 :003c0000 0504@081849 07f2:rlogin :#1561 #206 :003c0000 040 2: 0.0 0.0 4.7 0504@081856 07f3:ping :#1561 #206 :003c0000 0504@081859 07f3:ping :#1561 #206 :003c0000 042 0: 0.0 0.0 3.5 0504@082624 084f:xlock :#1561 #206 :003c0001 ... Date Time seq# Command userid group tty flg sig ucpu_sec scpu_sec elapsedThe above is not from a "real" problem, it just demonstrates activity from two tty's (one primarily in xlock) requesting uakpacct to write start and stop records then sorting.
This example comes from Irix. The problem is Macintosh kerberized BetterTelnet sessions on Octanes disappearing if left idle.mcgrew: uaklogin -p $$ # what does a live session look like? 05/04@172419 _pid__ _ppid_ __user__ Started VSZ SZ P CPU %CPU |__Command__ 0 1 0 root Apr 27 1712 107 * 0:05 0 |/etc/init -1 277 1 root Apr 27 1872 117 * 0:10 0 |/usr/etc/inetd -2 176405 277 root 17:12:59 3200 200 * 0:00 0 |telnetd -a valid -3 176048 176405 root 17:13:06 2816 176 * 0:00 0 |login -h brr.arsc.edu -p -F kcarlson -4 174739 176048 kcarlson 17:13:06 864 54 * 0:00 0 |-ksh mcgrew: uakpacct -u kcarlson -80 -c sh # find the session which died #End_Date/Time Start_hh:mm:ss Userid Command CPU_Sec :tty(hex) flg sig #------------- -------- ------ ------- ------- :-------- --- --- 05/04 14:14:34 14:14:34 kcarlson sh 0.0 :003c0003 041 0 05/04 16:38:19 04/27 15:25:42 kcarlson sh 0.2 :003c0002 040 1 kcarlson@mcgrew: uakpacct -T 16:38:19 -at 2 -80 # find parents #End_Date/Time Start_hh:mm:ss Userid Command CPU_Sec :tty(hex) flg sig #------------- -------- ------ ------- ------- :-------- --- --- 05/04 16:38:17 04/27 15:25:40 root telnetd 2.2 :ffffffff 042 0 05/04 16:38:19 04/27 15:25:42 #1561 sh 0.2 :003c0002 040 1 05/04 16:38:19 04/27 15:25:42 root login.kr 0.0 :003c0002 042 1It appears telnetd is dying and causing a SIGHUP to the shell. From this we know where to start the debugging effort.
When PE's die (they do, don't they), the console messages are cryptic on which process got whacked:31 Oct 01:28:25 (PE:0x10c) WARNING: Oct 31 01:28:04(AKS) (GPM) Pid 140078 "main.0" (uid 2008) killed due to pe failureYou cannot use "uakpacct -pid" because individual PEs do not write accounting records, their use is summarized in the MPP records. However, it is still easy to quickly identify what happened:yukon: uakpacct -c \*main.0\* Date Time jobid userid group Command flg sig ucpu_sec scpu_sec elapsed pid ppid ---- ---- ----- ------ ----- ------- --- --- -------- -------- ------- --- ---- 19991030@225025: 12781:zxy52b #1924 :main.0 Y:006 9:401847.9 2650.0 13059.5 25852 25840 19991030@225025/ 12781:zxy52b:#1924 /MPP-Record /32 mppe 19991031@012813: 13345:zxy52b #1924 :main.0 Y:006 0:664293.9 4388.0 20912.8 53009 52997 19991031@012813/ 13345:zxy52b:#1924 /MPP-Record /32 mppe yukon: /usr/local/adm/bin/uakpacct -job 12781,13345 -f /usr/adm/acct/day/nqacct 19991031@012808/ 12781:zxy52b/ 42903/TERM:noper rerun yukon run214b large 19991031@012809/ 13345:zxy52b/ 42903/INIT:started yukon run214b large 19991031@071650/ 13345:zxy52b/ 42903/TERM:exited yukon run214b large
This is really a non-Cray example. In a previous life we automated a nightly "uakpacct -mgmt" report requesting all processes which consumed more than 15 minutes (900 seconds) of CPU from each week day's pacct files to identify possible mis-use and to identify application code which may require tuning:glacier: uakpacct -cpu 900 -f "`ls /var/adm/pacct*`" -m #End_Date/Time Start_hh:mm:ss Userid Command CPU_Sec Elapsed Rate -------------- -------- ------ ------- ------- ------- ---- 09/04 10:42:03 08:45:06 jnblb oracle 1026.9 7017.0 14.6 09/04 12:25:04 09/03 12:00:32 sxfinp oracle 4428.2 87872.0 5.0From the example above, determine what jnblb was doing in Oracle. Since Oracle children tend to start or end at the same time as their parents, by filtering for processes within 30 seconds (Oracle servers may live longer lives) we can get a good clue as to what was being executed under Oracle:glacier: uakpacct -f /var/adm/pacct6 -m -r 0 -T 10:42:03 -t 08:45:06 -at 30 #End_Date/Time Start_hh:mm:ss Userid Command CPU_Sec Elapsed Rate -------------- -------- ------ ------- ------- ------- ---- 09/04 10:42:03 08:45:06 jnblb oracle 1026.9 7017.0 14.6 09/04 10:42:04 08:45:05 jnblb RPBVLDT 3.1 7019.0 0.0 09/04 10:42:05 08:45:00 jnblb G0.19970 0.2 7025.0 09/04 10:42:05 08:45:00 jnblb ua_gur_r 0.0 7025.0 09/04 10:42:05 08:44:57 jnblb ksh 0.3 7028.0The user was executing RPBVLDT program which may need modifications to more efficiently execute.
chilkoot: uaklogin -u kcarlson 05/04@144242 __Entry Time_:__pid__:__user__:__Access___:__Modify___:_id_:__line__:__type__:_trm_ext:__login host__ 05/04@080039 : 46228:kcarlson: 00.00.00: 00.00.00:p014:ttyp014 :User : 0, 0:brr.arsc.edu chilkoot: uaklogin -u kcarlson -ps 05/04@144216 __Entry Time_:__pid__:__user__:__Access___:__Modify___:_id_:__line__:__type__:_trm_ext:__login host__ _pid__ _ppid_ __user__ Started VSZ SIZE CPU_Time %CPU |__Command__ 05/04@080039 : 46228:kcarlson: 00.00.00: 00.00.00:p014:ttyp014 :User : 0, 0:brr.arsc.edu 3 46228 46203 kcarlson - 592 148 3.65 | [ksh] 4 64498 46228 kcarlson - 496 124 0.04 | [uaklogi] 5 64499 64498 kcarlson - 6560 1640 0.13 | [ps] 4 64496 46228 kcarlson - 608 152 4.99 | [egrep] Non-terminal processes: -1 20294 1 kcarlson - 944 236 2.91 | [xgraph]
# rm -f ${_adm}/Spacct*.${_date} find ${_adm}/Spacct*.* -mtime +7 -exec rm -f {} \;
Other session Super-record files can be built for other reasons. On 20 April 2000 at 11:20am ARSC experienced an unexpected power outage. Needless to say, a number of jobs were interrupted during execution. This event provided a good opportunity to examine what exception records (rerun, restart, and checkpoint) look like. The full example of extracting these records will be included in the uakpacct-v3.2 distribution.
On the T3E extract and examine several jobs impacted by the outage:
Both nqacct and pacct records for five jobs were extracted to create a Super-record for reporting, the five jobs:yukon: S="60308,60317,60319,60334,60337" yukon: uakpacct -f "`ls -1 2000042?.0001/Wnqacct1`" -seq $S -bin nqacct1 -out nq yukon: sort -k1.35,1.43 -k1 nq | egrep 'INIT|TERM|New' 20000427@023321/ 0:hav14a / 60308/RECV:New yukon runhski gcp 20000427@054734/ 58498:hav14a / 60308/INIT:started yukon runhski gcp_xxlarge 20000427@073950/ 58498:hav14a / 60308/TERM:held yukon runhski gcp_xxlarge 20000427@074147/ 58498:hav14a / 60308/INIT:restarted yukon runhski gcp_xxlarge 20000427@163624/ 19:hav14a / 60308/INIT:rerun yukon runhski gcp_xxlarge 20000427@194806/ 19:hav14a / 60308/TERM:exited yukon runhski gcp_xxlarge 20000427@045911/ 0:pan34x / 60317/RECV:New yukon s.cmd gcp 20000427@065504/ 58692:pan34x / 60317/INIT:started yukon s.cmd gcp_xxlarge 20000427@073822/ 58692:pan34x / 60317/TERM:held yukon s.cmd gcp_xxlarge 20000427@074158/ 58692:pan34x / 60317/INIT:restarted yukon s.cmd gcp_xxlarge 20000427@163624/ 20:pan34x / 60317/INIT:rerun yukon s.cmd gcp_xxlarge 20000428@000457/ 20:pan34x / 60317/TERM:exited yukon s.cmd gcp_xxlarge 20000427@050836/ 0:lin20c / 60319/RECV:New yukon run.unst05 mpp 20000427@061037/ 58561:lin20c / 60319/INIT:started yukon run.unst05 medium 20000427@073459/ 58561:lin20c / 60319/TERM:held yukon run.unst05 medium 20000427@103947/ 58561:lin20c / 60319/INIT:restarted yukon run.unst05 medium 20000427@163658/ 58561:lin20c / 60319/INIT:restarted yukon run.unst05 medium 20000427@180720/ 58561:lin20c / 60319/TERM:exited yukon run.unst05 medium 20000427@065327/ 0:dev07d / 60334/RECV:New yukon proms3d mpp 20000427@071436/ 58749:dev07d / 60334/INIT:started yukon proms3d xlarge 20000427@073023/ 58749:dev07d / 60334/TERM:held yukon proms3d xlarge 20000427@163817/ 58749:dev07d / 60334/INIT:restarted yukon proms3d xlarge 20000427@233245/ 58749:dev07d / 60334/TERM:exited yukon proms3d xlarge 20000427@071436/ 0:mas26x / 60337/RECV:New yukon pips128_s2 gcp 20000427@074058/ 58829:mas26x / 60337/INIT:started yukon pips128_s2 gcp_xxlarge 20000427@224414/ 1038:mas26x / 60337/INIT:rerun yukon pips128_s2 gcp_xxlarge 20000428@051025/ 1038:mas26x / 60337/TERM:exited yukon pips128_s2 gcp_xxlarge yukon: cut -c17-22 nq | sort -u | grep -v " 0" | while read F; do > if [ -z "$X" ]; then X=$F; else X="$X,$F"; fi; done yukon: uakpacct -f "`ls -1 2000042?.0001/Wpacct1`" -job $X -bin pacct1 -out /dev/null
Four of the jobs were held (checkpointed) prior to the outage. It is common at ARSC for T3E jobs to be checkpointed to reschedule priority work or to optimize the PE usage. Three of the jobs were re-rerun, one of those had never been check pointed. One job was checkpointed at time of outage, one restarted from the checkpoint after the outage, and two were rerun...why? From the nqs log:seqno Owner Start End elapse characteristics ----- ----- ----- --- ------ --------------- 60308 hav14a 02:23 19:48 17:25 started, held, restarted, rerun 60317 pan34x 04:59 00:04+1 19:05 started, held, restarted, rerun 60319 lin20c 05:08 18:07 12:59 started, held, restarted, restarted 60334 dev07d 06:53 23:32 16:39 started, held, restarted 60337 mas26x 07:14 05:10+1 21:56 started, rerun
In the the two checkpoint-rerun cases the user or job modified a file making restart impossible. For restart to work, the inode of all files must be intact.04/27 16:36:36 I shepherdprocess(): Request 60308.yukon: Attempting to restart request 04/27 16:36:36 I shepherdprocess(): Request 60317.yukon: Attempting to restart request 04/27 16:36:07 I shepherdprocess(): Request 60319.yukon: Attempting to restart request ... 04/27 16:36:08 W restartr(): Unable to restart job from file=/ckpt/nqs/chkpnt/++DiJ+++++1, error=50. 04/27 16:36:08 W restartr(): Errno 50 = File changed. 04/27 16:36:08 W nqs_restart(): Request 60308.yukon: NOT restarted, errno=50, File changed. 04/27 16:36:08 W nqs_restart(): Request 60308.yukon: Unrecoverable restart error. 04/27 16:36:08 I nqs_reqexi(): Request 60308.yukon: Exiting from queue gcp_xxlarge. 04/27 16:36:08 W shoreqmsg(): Request 60308.yukon: Server reports RCM_RESTARTERR. 04/27 16:36:08 I nqs_mai(): Request 60308.yukon: Sending requeued mail to hav14a >. 04/27 16:36:08 I nqs_mai(): premail created by fork, pid=1456, ppid=1449. 04/27 16:36:09 I nqs_mai(): Calling system function execve(2), file=/usr/bin/mailx. 04/27 16:36:09 I reqcom(): Request 60308.yukon: Exiting from queue gcp_xxlarge. 04/27 16:36:09 I reqcom(): Component shepherd exiting, pid=1449, rc=2. ... 04/27 16:36:09 W nqs_restart(): Request 60317.yukon: NOT restarted, errno=50, File changed. 04/27 16:36:09 W nqs_restart(): Request 60317.yukon: Unrecoverable restart error. ... 04/27 16:36:58 I enf_quotas(): Request 60319.yukon: Resumed. 04/27 16:36:58 I nqs_restart(): Request 60319.yukon: Restarted, ID=58561.
What do the jobs look like in resource reports:
Examining the above shows seqno=60337 with an expansion factor of 1.02; that cannot be right since the T3E was down from 11:20 until 16:35. The csa consolidation does not detect the abrupt outage to calculate the 5+ hours of down time or the rerun time as queue wait (or is that appropriately queue wait?). The other jobs which were running at the time of the outage reflect a similar problem, seqno=60319 had initial wait and held time of 4:07 hours which matches the reported queue wait, the outage time and rerun time was not counted.yukon: /usr/lib/acct/csagcon -S Super-record -ujac -R $SDIR/csagcon.yukon -o csagcon.out yukon: /usr/lib/acct/csagfef -f csagcon.out -D queue=0 -D peseg=0 $SDIR/csagfef.yukon # #nq_quename pm_t_pe_max pb_t_ pm_t_pe pm_t_pe nq_wall nq T:1 nq_ nq_reqname # usr:time _time _ctime clock _qwtime seqno # PEs CPUtime MPPtime DEDtime WallClk Q_Wait Expan #Queue Name Project Userid max (hours) (hours) (hours) (hours) seconds Fact. SeqNo RequestName #========== ======= ====== === ======= ======= ======= ======= ======= ===== ===== =========== gcp_xxlarge NSWCC004 hav14a 42 207.078 207.077 211.882 17.246 11770 1.23 60308 runhski gcp_xxlarge ONRDCC28 pan34x 50 403.066 403.064 406.366 19.096 7169 1.12 60317 s.cmd medium ONRDC084 lin20c 20 55.751 55.750 56.406 12.978 14809 1.46 60319 run.unst05 xlarge NRLDC213 dev07d 64 442.698 442.697 456.202 16.655 34143 2.32 60334 proms3d gcp_xxlarge NPSCAC43 mas26x 128 773.241 773.239 819.958 21.930 1582 1.02 60337 pips128_s2 *root root . 0 0.000 0.000 0.000 . . . *minimum minimum . 6 0.002 0.000 0.000 . . . # ==== ======= ======= ======= ======= ======= ===== # 11: 5:nqs 0:tty 5 1881.8 1881.8 1950.8 87.9 69473 1.20
The job which was held through the outage, seqno=60334, has a correct expansion factor representing the time waiting to start (0:21 hours) and the time held (9:08 hours) with an actual execution of initially 16 minutes in the morning and 6:54 hours in the evening.
Unfortunately, there is no automated means around the mis-reported expansion factors for jobs executing at the time of the outage. An alternative method of expansion factor calculation can be used defining the "execution time" as:
DEDtime / PEs_max_ + CPUtime - MPPtime
instead of subtracting the queue wait time from the wall clock time in system. Using this method, we see expansion factors of:
While those are more realistic expansion factors for these jobs, this method in general would exclude I/O wait time and could be grossly wrong for any job containing multiple mpprun commands. Fortunately, power outages are very rare to cause such anomalies, at least at ARSC.yukon: /usr/lib/acct/csagfef -f csagcon.out -D queue=0 -D peseg=0 -D expan=2 $SDIR/csagfef.yukon # #nq_quename pm_t_pe_max pb_t_ pm_t_pe pm_t_pe nq_wall nq T:2 nq_ nq_reqname # usr:time _time _ctime clock _qwtime seqno # PEs CPUtime MPPtime DEDtime WallClk Q_Wait Expan #Queue Name Project Userid max (hours) (hours) (hours) (hours) seconds Fact. SeqNo RequestName #========== ======= ====== === ======= ======= ======= ======= ======= ===== ===== =========== gcp_xxlarge NSWCC004 hav14a 42 207.078 207.077 211.882 17.246 11770 3.42 60308 runhski gcp_xxlarge ONRDCC28 pan34x 50 403.066 403.064 406.366 19.096 7169 2.35 60317 s.cmd medium ONRDC084 lin20c 20 55.751 55.750 56.406 12.978 14809 4.59 60319 run.unst05 xlarge NRLDC213 dev07d 64 442.698 442.697 456.202 16.655 34143 2.33 60334 proms3d gcp_xxlarge NPSCAC43 mas26x 128 773.241 773.239 819.958 21.930 1582 3.42 60337 pips128_s2 *root root . 0 0.000 0.000 0.000 . . . *minimum minimum . 6 0.002 0.000 0.000 . . . # ==== ======= ======= ======= ======= ======= ===== # 11: 5:nqs 0:tty 5 1881.8 1881.8 1950.8 87.9 69473 3.17
What do the MPP and process accounting records look like for these jobs:
Each of the four checkpoints wrote MPP and process records with a sig=9 which is SIGKILL (see sys/signal.h), this is apparently a signature of checkpointing on the T3E. Note, we have "accountable" CPU time at this point. There are also process and MPP records for each job after the outage.yukon: uakpacct -f pacct1 -Ro -80 #End_Date/Time Start_hh:mm:ss Userid Command CPU_Sec :tty(hex) flg sig #------------- -------- ------ ------- ------- :-------- --- --- 04/27 07:30:22 07:14:45 dev07d proms3d 49871.5 : 006 9 20000427@071445/ 58749:dev07d :#2130 /MPP-Record /64 mppe 04/27 07:34:59 06:15:43 lin20c unclem 93735.9 : 006 9 20000427@061543/ 58561:lin20c :#1888 /MPP-Record /20 mppe 04/27 07:38:22 06:57:38 pan34x x 115135.0 : 006 9 20000427@065738/ 58692:pan34x :#2111 /MPP-Record /50 mppe 04/27 07:39:49 05:47:40 hav14a x 275979.9 : 006 9 20000427@054740/ 58498:hav14a :#2112 /MPP-Record /42 mppe 04/27 18:07:05 16:36:58 lin20c unclem 106963.3 : 006 0 20000427@163658/ 58561:lin20c :#1888 /MPP-Record /20 mppe 04/27 19:47:30 16:36:30 hav14a x 469498.7 : 006 0 20000427@163630/ 19:hav14a :#2112 /MPP-Record /42 mppe 04/27 23:32:43 16:38:17 dev07d proms3d 1543836.0 : 006 0 20000427@163817/ 58749:dev07d :#2130 /MPP-Record /64 mppe 04/28 00:04:48 04/27 16:36:30 pan34x x 1335895.6 : 006 0 20000427@163630/ 20:pan34x :#2111 /MPP-Record /50 mppe 04/28 05:10:18 04/27 22:45:56 mas26x cpop.128 2783660.8 : 004 0 20000427@224556/ 1038:mas26x :#2085 /MPP-Record /128 mppe
Are the CPU times accurate in the resource reports after consolidation? From the pacct records converting seconds to hours we get:
This does match exactly the csagfef output. Is this correct?pre- post- seqno Owner hold hold total characteristics ----- ----- ---- ---- ----- --------------- 60308 hav14a 76.6 + 139.4 = 207.0 started, held, restarted, rerun 60317 pan34x 32.0 + 371.0 = 403.0 started, held, restarted, rerun 60319 lin20c 26.0 + 29.7 = 55.7 started, held, restarted, restarted 60334 dev07d 13.8 + 428.9 = 442.7 started, held, restarted 60337 mas26x n/a + 773.2 = 773.2 started, rerun
Above we covered some T3E examples rather exhaustively. The same outage affected the J90, without the extraction detail:
There is an obvious difference on the J90 that jobs may run for several days (or weeks). At ARSC J90 jobs are rarely checkpointed, only very long running jobs and prior to system maintenance (which is held to a minimum). At the time of the power outage on April 20, the last J90 reboot was on January 1 after the Y2k shutdown. Consequently, all executing jobs were re-run from beginning. For really long running jobs this is an argument for preventative check-pointing. What do the resource reports look like:seqno Owner Start End elapse characteristics ----- ----- ----- --- ------ --------------- 58371 wtg48y 04/24@05:31 04/28@06:13 4+00:42 started, rerun 58394 yz49 04/24@08:45 04/27@19:46 3+11:01 started, rerun 58607 tom43 04/26@08:56 04/28@04:30 1+19:34 started 58685 sha40x 04/27@01:54 04/28@18:35 1+16:41 started, rerun 58690 zed51a 04/27@09:39 04/27@16:08 0+06:29 started, rerun
chilkoot: /usr/lib/acct/csagfef -f csagcon.out -D queue=0 $SDIR/csagfef.chilkoot # #nq_quename pb_t_phimem_max pb_t_ pu_p_ pe_t_je nq_ pb_t ctime cpu*m je|pb nq_ nq_reqname # us:time mutime time* qwtime _ctime seqno # MWrds CPUtime mutime WallClk Q_Wait Connect Expansion-Factors #Queue Project Userid HiMem (hours) hours (hours) (secs) (hours) Conn. CPU Lapse SeqNo RequestName #===== ======= ====== ===== ======= ====== ======= ====== ======= ===== ===== ===== ===== =========== Small NWUMCD wtg48y 29.7 8.202 6.8 96.698 2 6.791 14.24 11.79 1.00 58371 gb Small NWUMCD yz49 18.2 6.817 3.3 83.019 1743 3.332 24.91 12.18 1.01 58394 B Medium NSWCC002 tom43 106.9 5.493 0.0 43.569 109201 5.409 8.06 7.93 3.29 58607 iflow Small HURRICAN sha40x 4.9 10.775 10.4 40.694 6773 10.427 3.90 3.78 1.05 58685 CHILBAL28f31loo Small ONRDC158 zed51a 51.9 0.651 0.0 6.483 3691 0.630 10.28 9.96 1.19 58690 ns.grid *root root . 0 0.000 0.0 . . 0.000 . . . *min minimum . 10 0.001 0.0 . . 0.000 . . . #===== ======= ====== ===== ======= ====== ======= ====== ======= ===== ===== ===== # 15 5:nqs 0:tty 5 31.9 20.5 270.5 121410 26.6 10.17 8.47 1.14With the J90 we calculate three expansion factors:
What do the process accounting and multi-tasking records look like?
chilkoot: uakpacct -cpu 300 -f pacct1 -mut Date Time jobid userid group Command flg sig ucpu_sec scpu_sec elapsed pid ppid ---- ---- ----- ------ ----- ------- --- --- -------- -------- ------- --- ---- 20000427@104128:16779:zed51a #2148 :a.out :000 0: 484.7 6.4 1394.0 72685 72666 20000427@151708: 41:zed51a #2148 :a.out :000 0: 480.9 4.1 905.8 1709 1503 20000427@153341: 41:zed51a #2148 :a.out :000 0: 428.5 7.6 754.1 3096 1503 20000427@151646: 39:yz49 #974 :flapw.9 :004 9: 24093.4 439.3 16179.9 1656 1521 20000427@151646/ 39:yz49 :#974 /Multi-task: 4|12/ 24093-u 11990-c 1*5102 2*3221 3*2118 4*1549 20000427@151644: 48:tom43 #1708 :iflow36 :000 0: 19468.3 303.6 47640.6 1584 1500 20000427@151644: 38:wtg48y #974 :run :004 9: 28784.3 739.5 53818.6 1601 1 20000427@151644/ 38:wtg48y :#974 /Multi-task: 8|12/ 28782-u 24439-c 1*22764 2*677 3*328 4*196 5*146 6*161 7*139 8*29 20000427@151644: 43:sha40x #912 :CHILBAL2:004 0: 37531.3 1249.1 98333.3 1599 1510 20000427@151644/ 43:sha40x :#912 /Multi-task: 4|12/ 37531-u 37530-c 1*37530 2*0 3*0 4*0
The zed51a job, seqno=58690, did have significant processing (in terms of its aggregate use) prior to the outage which was effectively "double billed". Note, in the above only processes greater than 300 CPU seconds were requested, zed51a did have other shorter processes to aggregate to 0.630 hours, for example:
chilkoot: uakpacct -cpu 20 -f pacct1 -u zed51a -80 #End_Date/Time Start_hh:mm:ss Userid Command CPU_Sec :tty(hex) flg sig #------------- -------- ------ ------- ------- :-------- --- --- 04/27 11:04:42 10:41:28 zed51a a.out 491.1 :ffffffff 000 0 04/27 11:06:10 11:04:43 zed51a cft90 35.9 :ffffffff 000 0 04/27 15:32:14 15:17:08 zed51a a.out 485.0 :ffffffff 000 0 04/27 15:33:22 15:32:15 zed51a cft90 35.1 :ffffffff 000 0 04/27 15:46:15 15:33:41 zed51a a.out 436.1 :ffffffff 000 0 04/27 15:47:40 15:47:04 zed51a cft90 20.4 :ffffffff 000 0 04/27 15:50:56 15:48:02 zed51a cft90 109.2 :ffffffff 000 0 04/27 15:53:24 15:51:17 zed51a cft90 72.5 :ffffffff 000 0 04/27 15:55:56 15:55:10 zed51a cft90 25.8 :ffffffff 000 0 04/27 15:57:24 15:56:26 zed51a cft90 34.6 :ffffffff 000 0 04/27 15:59:11 15:58:29 zed51a cft90 26.3 :ffffffff 000 0 04/27 16:01:19 15:59:12 zed51a cft90 71.8 :ffffffff 000 0 04/27 16:03:09 16:01:56 zed51a cft90 39.3 :ffffffff 000 0 04/27 16:07:16 16:04:16 zed51a cft90 90.5 :ffffffff 000 0
Both of the above packages are available from 'ftp.alaska.edu/pub/sois/'. The uakpacct kit contains the csa post-processing tools described above. Both packages are "as is", both are running under Unicos, Unicos/mk, Irix, and Digital UNIX (whatever its name presently is). They are likely to be ported to Solaris and Linux eventually.