Internal Production: Last Updated 2009-11-06
ARSC HPC systems use workload management software to control the flow of user jobs. This software allows a user to submit a job into a queue. The job will wait in a queue until sufficient resources (e.g. processors) are available for the job to run.
midnight uses the PBS Pro workload management software to control the flow of work.
| Command | Overview |
|---|---|
$(PBS_EXEC)/bin/qstat |
Shows job, queue and PBS server information |
/usr/local/bin/qmap |
Shows job information and a matrix of work. (ARSC Developed) |
$(PBS_EXEC)/bin/qsub |
Submits a job to PBS |
$(PBS_EXEC)/bin/qdel |
Delete a job from the PBS queues |
$(PBS_EXEC)/bin/pbsnodes |
Shows information about nodes used by PBS |
$(PBS_EXEC)/bin/tracejob |
Shows job scheduling, server and mom log information. Run on mpbs1 |
$(PBS_EXEC)/bin/pbs_rsub |
Creates a PBS Advanced Reservation |
$(PBS_EXEC)/bin/pbs_rdel |
Delete a PBS Advanced Reservation |
$(PBS_EXEC)/bin/qmgr |
Displays or alters queue settings (administrative command) |
$(PBS_EXEC)/sbin/pbs-report |
Displays summary of job statics for a range of time . Run on mpbs1 |
/usr/local/bin/show_usage |
Shows the remaining allocation for a project (UNIX group). Run on midnight1 or midnight2. (ARSC Developed) |
(See /etc/pbs.conf for the value of PBS_EXEC)
pingo uses the PBS Pro workload management software and Cray ALPS to control the flow of work.
| Command | Overview |
|---|---|
$(PBS_EXEC)/bin/qstat |
Shows job, queue and PBS server information |
/usr/local/bin/qmap |
Shows job information and a matrix of work. |
$(PBS_EXEC)/bin/qsub |
Submits a job to PBS |
$(PBS_EXEC)/bin/qdel |
Delete a job from the PBS queues |
$(PBS_EXEC)/bin/pbsnodes |
Shows information about nodes used by PBS |
$(PBS_EXEC)/bin/tracejob |
Shows job scheduling, server and mom log information. Run on sdb. |
$(PBS_EXEC)/bin/pbs_rsub |
Creates a PBS Advanced Reservation |
$(PBS_EXEC)/bin/pbs_rdel |
Delete a PBS Advanced Reservation |
$(PBS_EXEC)/bin/qmgr |
Displays or alters queue settings (administrative command) |
$(PBS_EXEC)/sbin/pbs-report |
Displays summary of job statics for a range of time. Run on sdb. |
/usr/bin/apstat |
Displays information about jobs running on XT compute nodes. ALPS command. |
xtnodestat |
Displays a matrix summary of work on XT compute nodes. Cray command. |
/usr/local/adm/bin/search_alps |
Search ALPS records for particular jobs. Run on sdb. |
/usr/local/bin/show_usage |
Shows the remaining allocation for a project (UNIX group). Run on login node (pingo1-6). |
(See /etc/pbs.conf for the value of PBS_EXEC)
Projects receive an allocation of CPU hours. This allocation allows a project to run jobs in a foreground queue.
| Queue | Description |
|---|---|
| standard | For most projects |
| challenge | Restricted to "challenge" projects. Higher priority than standard. |
| high | For projects which have received special access via S/AAA request. Higher priority than challenge. |
| urgent | For projects which have received special access from the director of the HPCMP. Highest priority. |
| special | For projects or users that have requirements that do not match normal standard queue requirements. Same priority as standard. |
| background | Background queue. Does not deduct CPU allocation |
| data | Queue with access to long term storage ($ARCHIVE_HOME) |
Other queues may be configured, but may not be available for general use.
The "high" queue is available to HPCMP customers that have a need to run jobs at a higher priority (than "standard") for a pre-determined period of time. This queue has limits and priority similar to the challenge queue. Requests for access to the "high" queue should be made through the project S/AAA.
Procedure:
The "special" queue is available to customers that have a need to run jobs that do not fit within the normal walltime or processor count limits. This queue has limits and priority similar to the standard queue. Requests for access to the "special" queue should be sent to the ARSC help desk for approval.
Procedure:
# check existing groups given access to special queue
qmgr -c "print server" | grep special
# add project to special queue
qmgr -c "set queue special acl_groups += project_name"
# double check to confirm group was added correctly
qmgr -c "print server" | grep special
# check existing groups given access to special queue
qmgr -c "print server" | grep special
# remove project from the special queue
qmgr -c "set queue special acl_groups -= project_name"
# double check to confirm group was added correctly
qmgr -c "print server" | grep special
The "background" queue is required by HPCMP policy. Jobs running in this queue do not use CPU allocation. This queue has the lowest priority. Consultants may alter the number of CPUs available to this queue to ensure that background work doesn't interfere with foreground work. Note "background" work should be allowed to run to completion even when foreground work is waiting.
Procedure:
# check existing limits
qmgr -c "print queue background_lg"
# raise or lower the limit
qmgr -c "set queue background_lg resources_available.ncpus = 128"
# check existing limits
qmgr -c "print queue background"
# raise or lower the limit
qmgr -c "set queue background resources_available.mppwidth = 128"
Allocations on midnight and pingo are enforced via the use of PBS acl_groups. These Access Control Lists (ACLs) are updated two times a day via the usermon crontab on midnight1 and pingo1. Updates in the allocation database are propagated at these times.
| Time | Command |
| 4:30 AM | /var/local/archives/consult/remain_alloc.prl |
| 4:45 AM | /usr/local/pkg/pbstools/sbin/set_acls -q |
| 1:30 PM | /var/local/archives/consult/remain_alloc.prl |
| 1:45 PM | /usr/local/pkg/pbstools/sbin/set_acls -q |
| Time | Command |
| 4:30 AM | /var/local/archives/consult/remain_alloc.prl --file /usr/local/pkg/pbstools/etc/alloc/remain_alloc.txt --tmpdir /wrkdir/usermon/remain_alloc --system pingo |
| 4:45 AM | /usr/local/pkg/pbstools/sbin/set_acls -q |
| 1:30 PM | /var/local/archives/consult/remain_alloc.prl --file /usr/local/pkg/pbstools/etc/alloc/remain_alloc.txt --tmpdir /wrkdir/usermon/remain_alloc --system pingo |
| 1:45 PM | /usr/local/pkg/pbstools/sbin/set_acls -q |
The set_acls script runs via the usermon crontab to set the UNIX groups that can access each queue (i.e. PBS queue "acl_groups"). The following files control the operation of the set_acls script:
| File | Purpose |
/usr/local/pkg/pbstools/etc/alloc/remain_alloc.txt | The remaining allocation file |
/usr/local/pkg/pbstools/sbin/set_acls.ini | The set_acls configuration file (1) |
NOTES:
(1) The set_acls.ini file defines which queues are available to standard work and
which queues are available to challenge work. This file also defines which projects
have access to the challenge queue. Each challenge project must be manually added
to this file to have the proper group_acls set. Additional options are available,
see the configuration file for details.
Allocation records updates can be propogated to either midnight or pingo with the following command:
midnight1 % /projects/consult/sbin/update_allocations
pingo1 % /projects/consult/sbin/update_allocations
It may be necessary to block a particular user from the queues due to policy violations or their negative impact to other user's work on the system. In general, it's best to attempt to communicate with the user via email or telephone, however if you cannot reach the user or if the user fails to act within the requested timeframe, a user should be blocked from the queues.
Procedure:
# check current acl_user settings.
qmgr -c "print server" | grep acl_user
# Add access for all users by default if it's not already set
qmgr -c "set queue standard acl_users = +"
# Remove access for the particular user
qmgr -c "set queue standard acl_users += -username"
# enable acls for the queue(s) if they aren't already enabled.
qmgr -c "set queue standard acl_user_enable=True"
NOTE: ACLs may need to be set on more than one queue in order to keep the
user from running any jobs (e.g. background, standard, data, debug)
# Restore access for the particular user
qmgr -c "set queue standard acl_users -= -username"
The ordering of jobs within the queue can be altered by changing the value of "jprio" for
a job. (The default value for jprio is 0).
midnight% qalter -l jprio=1000 PBS_JOBID
Advanced reservation functionality is available within PBS. This functionality allows nodes to be reserved for a particular user or group at some point in the future. This may be useful for code porting, debugging, and approved daily (i.e. operational) activities such as weather forecasting.
Procedure:
# generate the reservation
pbs_rsub -l select=32:ncpus=4:node_type=4way -U username -R 200902231200 -D 86400 -I 120
# NOTE be sure the reservation is confirmed before continuing
# The pbs_rsub command will return the reservation id (i.e. reservation queue name)
# generate the reservation
pbs_rsub -l select=32:ncpus=4:node_type=4way -H '*' -G unix_group -U + -R 200902231200 -D 86400 -I 120
# NOTE be sure the reservation is confirmed before continuing
# The pbs_rsub command will return the reservation id (i.e. reservation queue name)
From: ARSC Help Desk <consult@arsc.edu> Date: Februrary 14, 2009 4:16:29 PM GMT-08:00 Subject: Reservation Created on Midnight for 12:00 on 2009-02-23 Dear User, Per your request we have created a reservation for 128 CPU on midnight. Start Time: 2009-02-23 12:00 Resources: 128 CPUs (32- 4way nodes) Queue Name: R12345 Duration: 8:00:00 You may submit one or more jobs to this reservation with the following syntax: % qsub -q R12345 -l select=32:ncpus=4:node_type=4way myjob.pbs or by adding the following PBS directives to your job script: #!/bin/bash #PBS -q R12345 #PBS -l select=32:ncpus=4:node_type=4way #PBS -j oe #PBS -l walltime=8:00:00 cd $PBS_O_WORKDIR ... ... We request reservation cancellations be made at least 8 business hours before the start of the reservation. If you have questions about the use of this reservation, please let us know. Thanks, Don -- ARSC Help Desk Email: consult@arsc.edu Web: http://www.arsc.edu/support/ Phone: (907) 450-8602 Fax: (907) 450-8601
Standing reservations are a special type of reservation which repeats over a period of time. Requests for standing reservations should follow the same guidelines as standard advanced reservations, however keep in mind that standing reservations will occur multiple times and may use significant resources.
Procedure:
# set the time zone (required for standing reservations)
export PBS_TZID="America/Anchorage";
# generate the reservation
pbs_rsub -l select=32:ncpus=4:node_type=4way -U username -R 200902231200 -r "FREQ=DAILY;UNTIL=20090323" -D 86400 -I 120
# NOTE be sure the reservation is confirmed before continuing
# The pbs_rsub command will return the reservation id (i.e. reservation queue name)
# set the time zone (required for standing reservations)
export PBS_TZID="America/Anchorage";
# generate the reservation
pbs_rsub -l select=32:ncpus=4:node_type=4way -H '*' -G unix_group -U + -R 200902231200 -r "FREQ=DAILY;UNTIL=20090323" -D 86400 -I 120
# NOTE be sure the reservation is confirmed before continuing
# The pbs_rsub command will return the reservation id (i.e. reservation queue name)
From: ARSC Help Desk <consult@arsc.edu> Date: Februrary 14, 2009 4:16:29 PM GMT-08:00 Subject: Standing Reservation Created on Midnight Dear User, Per your request we have created a reservation for 128 CPU on midnight. Start Time: 2009-02-23 12:00 Reoccuring: DAILY End Date: 2009-03-23 Resources: 128 CPUs (32- 4way nodes) Queue Name: S67890 Duration: 8:00:00 You may submit one or more jobs to this reservation with the following syntax: % qsub -q S67890 -l select=32:ncpus=4:node_type=4way myjob.pbs or by adding the following PBS directives to your job script: #!/bin/bash #PBS -q S67890 #PBS -l select=32:ncpus=4:node_type=4way #PBS -j oe #PBS -l walltime=8:00:00 cd $PBS_O_WORKDIR ... ... We request reservation cancellations be made at least 8 hours business hours before the start of the reservation. If you have questions about the use of this reservation, please let us know. Thanks, Don -- ARSC Help Desk Email: consult@arsc.edu Web: http://www.arsc.edu/support/ Phone: (907) 450-8602 Fax: (907) 450-8601
Several dynamic resource checks are configured on midnight and pingo. These resources ensure that the appropriate number of licenses will be available for a job at run-time and should be added to PBS scripts to indicate the number of licenses required by a job.
| Product | Resource Name | Notes |
| Cobalt | -l cobalt=num | Value should match the number of CPUs used by the job. (Applies to Pingo and Midnight)LM_LICENSE_FILE=1737@rls1.csi.hpc.mil |
| Abaqus | -l abaqus=num | Number of tokens used by a job is licenses = int(ncpus ^ 0.422 * 5) (Midnight Only)LM_LICENSE_FILE=1727@rls1.csi.hpc.mil
|
| Fluent | -l fluent=1,fluentpar=num-1 | The value of fluentpar should be one less than the requested number of CPUs (Midnight Only)LM_LICENSE_FILE=1731@rls1.csi.hpc.mil
|
| Abaqus (Academic License) | -l abaqus_acad=num | Number of tokens used by a job is licenses = int(ncpus ^ 0.422 * 5) (Midnight Only)LM_LICENSE_FILE=7219@license1.arsc.edu
|
License checking is done via the following
/usr/local/pkg/license_check/bin/abaqus/usr/local/pkg/license_check/bin/abaqus_acad/usr/local/pkg/license_check/bin/cobalt/usr/local/pkg/license_check/bin/fluent/usr/local/pkg/license_check/bin/lic_buffer_client --app cobalt (returns the number of available cobalt licenses