rnl@cray.com
ABSTRACT
It is an ongoing challenge to deliver the power of the Cray X1 system to user applications as efficiently as possible. psched, the Cray X1 system placement scheduler derived from its Cray T3E system predecessor, works directly with system hardware to offer a flexible and efficient scheduler to system administrators and users. psched works in cooperation with the PBS Pro resource manager from Altair Engineering, Inc. to organize and oversee the execution of the most demanding and complex high-performance computing workloads.
KEYWORDS:
Cray X1, scheduling
Applications require
three things before they can run on the Cray X1 system: the application
startup code, the kernel, and system placement by psched, the placement scheduler daemon. psched determines on which nodes an application should
run and returns this placement
information to the kernel. When the application is then initiated
with an exec() system call, the placement information guides
the startup process to organize the components of the application into
an optimized configuration on the system.
The remainder of this paper:
Throughout this paper the command aprun is used when application initiation is discussed. Except for some minor option differences, aprun and mpirun are equivalent.
The placement
scheduler used on the Cray X1 is derived from the scheduling daemon
used on the Cray T3E. While much of their source code is identical, a
certain amount of the code required extensive redesign to support the
Cray X1 system architecture with its MSP/SSP processors and multiple
processors per node.
A significant difference from the Cray T3E is that the Unicos/mp kernel
does not directly support application placement. This means
applications cannot run on the Cray X1 system without psched.
The Cray X1 system is made up of two or more nodes connected through a high-speed network. Each node typically has 16GB of memory and 16 single stream processors (SSPs), grouped into four multi-stream processors (MSPs). The nodes are labeled with one or more flavors which determine what work will be assigned to the node. These flavors are:
To run applications on the Cray X1 system, support, OS, and application flavors must be assigned to at least one node. The usual configuration has one node sharing support and OS flavor, with the remaining nodes assigned application flavor. Flavors are configured with the snflv command (short for “set node flavor”) during system startup; but it is possible to alter flavors to an extent during operation. See the Overflow Nodes section for more information.
Some important features of psched are:
An application is started by executing an aprun command. This command initiates the following chain of events that result in an application running on one or more application nodes:
Three aprun options control how an application is placed and set up in application space. These are:
The depth [-d] option is used when multithreaded PEs make up
an application. psched will allocate enough resources (the sum of width
and depth) to run all of the threads at the same time on separate
processors.
The shape option is used to guarantee a certain application layout.
This might be useful for run-to-run consistency or to accommodate large
per PE memory usage. Without the -N option, psched will choose the number of PEs per node based on
what free space happens to be available. Once the application has been
given a shape, that information becomes part of the application's
distributed memory addressing format and cannot be changed.
Distributed memory
applications typically have some degree of inter-PE synchronization to
maintain orderly progress among the parallel portions of the solution.
Implicit in synchronization design is that parallel execution will
proceed at the same time. The system guarantees parallel execution by
scheduling at once (that is, in a gang) all of the processors
and memory that belong to an application. psched will gang schedule an application when there is
competition for its resources. If many
applications are present, one or more may run at the same time provided
that they do not compete with each other. All of the gangs eligible to
run at once are named a party. The set of gangs that constitute
a party execute at the same time. An application may be promoted
from one party to another and so get more processor cycles if it is
located such that it does not compete with the other gangs in the time
slice.
psched time slices typically range from a few seconds
to many minutes. The overhead cost of a context switch between parties
primarily involves loading memory for applications about to execute.
Distributed memory applications that span more than one node run with
their memory locked in place. This is required to satisfy off-node
memory references that cannot tolerate page faults.
Context switches are accomplished in two phases:
Unlocked memory pages are only written to disk to make space for an application beginning to execute. Context switching among applications whose memory usage does not exceed the physical capacity of the node does not require any memory loading. The duration of a time slice is partly determined by the time needed to swap memory for a new application. If much memory loading is needed, the time to do this should be factored into the time slice duration to amortize the context switch overhead. Durations can be changed at any time to accommodate changes in system load or to experiment with different scheduling behavior.
Applications in accelerated mode must be placed in contiguous nodes. Migration is the process of moving a running application from one set of nodes to another (possibly overlapping) set of nodes. Migration is needed to merge empty gaps in the node range so larger applications can be placed.
Overflow nodes
facilitate applications that need all nodes available on the machine.
Overflow nodes are optional, but they are not recommended for general
use by applications.
Ordinary application nodes are assigned an “application” flavor, and no
other. An overflow node is used for special needs and is typically
assigned flavors in addition to “application”, such as "support" or
"operating system." Overflow nodes are not dedicated to application
use, so they must accommodate the operating system and/or support work
as well. The heavy use of overflow nodes may impair system response to
commands and network traffic.
psched must be configured to provide overflow nodes;
but those nodes must be protected from unnecessary use. To do this, an
overflow region must be configured in addition to the usual
application region, and assigned to the application domain. Nodes can be assigned to a region using either
their node numbers or their flavor signature, the method used
in most cases because it is easier to configure. The flavor signature
of pure application nodes is "ap", and the flavor signature of an overflow node
is usually "os su ap".
The overflow region must be protected from applications small enough to be placed without using overflow nodes. This is done by restricting the use of overflow nodes (via the “width” gate) to applications that require them. See the UNICOS/mp Resource Administration Guide for details about setting up overflow nodes and regions.
The checkpoint process
does not need intervention from psched. When an application is checkpointed, its apteam state and the psched recovery information attached to the apteam is captured.
A restart of an application by cpr is accomplished with the cooperation of psched. It works as follows:
A psched administrator is the root user, plus any users
with administrative privilege declared in the psched configuration
file. The psched daemon is started during network startup, and it
should remain present during system operation. The
daemon can be controlled through its RPC interface that is used for
both posting applications for placement and administrative access.
Using the psmgr command, the administrator can alter the
configuration variables to control such scheduling features as the time
slice or load balancing parameters.
There are a number of administrative and user commands:
Please refer to the man pages for command information and the UNICOS/mp Resource Administration Guide for configuration and administrative details.
placement
information: This is an array of
structures attached to the apteam that tell the kernel and startup how
the application is allocated. Each entry contains a node identifier and
a count of the number of SSPs the node will need to use.
apteam: An apteam is the kernel's representation of the
application. It is accessed using the application ID (apid). Kernel and other information needed to manage
the apteam is contained or attached to the entry. For
example, the application recovery information needed by psched and checkpoint is attached to the apteam entry. The apstat command provides various views of one or more apteam entries. The apteam entry is created and destroyed by psched.
placed exec()/fork(): This is an exec() or fork() system call preceded by an apteam SelectPlace request. This tells the kernel to locate the
process that is making the call on the specified application node and
then complete loading the requested a.out file for an exec or performing the fork.