Cray X1 Application Scheduling

Richard Lagerstrom
Cray Inc.
1340 Mendota Heights Road
Mendota Heights MN 55120
651 605 9019
Fax: 651 605 9001
rnl@cray.com

ABSTRACT

It is an ongoing challenge to deliver the power of the Cray X1 system to user applications as efficiently as possible. psched, the Cray X1 system placement scheduler derived from its Cray T3E system predecessor, works directly with system hardware to offer a flexible and efficient scheduler to system administrators and users. psched works in cooperation with the PBS Pro resource manager from Altair Engineering, Inc. to organize and oversee the execution of the most demanding and complex high-performance computing workloads.

KEYWORDS:

Cray X1, scheduling

1 Introduction

Applications require three things before they can run on the Cray X1 system: the application startup code, the kernel, and system placement by psched, the placement scheduler daemon. psched determines on which nodes an application should run and returns this placement information to the kernel. When the application is then initiated with an exec() system call, the placement information guides the startup process to organize the components of the application into an optimized configuration on the system.

The remainder of this paper:

outlines the general aspects and history of psched,
discusses application placement and startup to help users understand the stages an application passes through during its lifetime,
describes gang scheduling,
describes application migration and
explains configuration and use of overflow nodes.

Throughout this paper the command aprun is used when application initiation is discussed. Except for some minor option differences, aprun and mpirun are equivalent.

2 History

The placement scheduler used on the Cray X1 is derived from the scheduling daemon used on the Cray T3E. While much of their source code is identical, a certain amount of the code required extensive redesign to support the Cray X1 system architecture with its MSP/SSP processors and multiple processors per node.

A significant difference from the Cray T3E is that the Unicos/mp kernel does not directly support application placement. This means applications cannot run on the Cray X1 system without psched.

3 Cray X1 Overview

The Cray X1 system is made up of two or more nodes connected through a high-speed network. Each node typically has 16GB of memory and 16 single stream processors (SSPs), grouped into four multi-stream processors (MSPs). The nodes are labeled with one or more flavors which determine what work will be assigned to the node. These flavors are:

Support: Commands, daemons and non-application work runs on support nodes.
OS: Nodes with this flavor run system calls and kernel work.
Application: Nodes with this flavor run applications. The aprun command places work on nodes with this flavor.
None: The node will not be used.

To run applications on the Cray X1 system, support, OS, and application flavors must be assigned to at least one node. The usual configuration has one node sharing support and OS flavor, with the remaining nodes assigned application flavor. Flavors are configured with the snflv command (short for “set node flavor”) during system startup; but it is possible to alter flavors to an extent during operation. See the Overflow Nodes section for more information.

4 Features of psched

Some important features of psched are:

It accepts requests for application placement from aprun and mpirun; and it locates sets of nodes that satisfy the requirements of the application, the hardware, and operating system.
It evaluates machine and node usage to best manage resources in consideration of the environment and load backlog.
Time slice, load balancing, and a number of other controls and limits may be changed during operation without restarting psched.
It schedules processors as a group (that is, “gang scheduling”) to maximize application performance and minimize interference among applications.
It maintains sufficient information to recover all work should psched fail or be terminated. Typically psched needs only to be restarted to recover all running applications.
It assists checkpoint restart (cpr) in placing an application and getting it running.
It generates up-to-date process information, displayed via the psview utility.
It provides support for user limits such as application size and time limit. There is also an extensive set of node access controls that may be used for special needs. Support of "overflow nodes" is included in these access controls.
It provides load and configuration information to PBSpro, and enforce the PBSpro queue limits for applications.

5 Running an Application

An application is started by executing an aprun command. This command initiates the following chain of events that result in an application running on one or more application nodes:

Execute aprun -n pes ... a.out
aprun constructs a posting message for psched that uses the command line options and information from the a.out file. Obvious errors are detected at this point before posting the request. If there are no locally detected errors, the posting message is sent to psched using RPC.
A posting message arriving at psched activates the creation of an internal representation of the application in the daemon, and an apteam entry in the kernel. The aprun command is joined to the apteam; so any change in its state is detected by the kernel and passed on to psched.
If psched detects an error, a message indicating the error is returned to aprun, which writes the message to stderr, and exits with a non-zero exit code.
If the posting is successful, aprun stops and waits for a signal (SIGUSR1) from psched to proceed.
When psched finds a place to allocate the application, the placement list is attached to the apteam and SIGUSR1 is sent to aprun.
aprun catches the signal and makes a placed exec() system call to start PE 0 of the application.
The first PE of the application begins execution on the specified application node. The startup code gets information about the size and layout of the application from the apteam entry and the a.out file. This process sets up the distributed memory environment and gets the remaining PEs started using a series of placed forks().
When the application finishes, each PE exits in the same way as an ordinary process. The kernel knows the process IDs of the members of the apteam and removes them as they exit. PE 0 is a special case because it must not be removed until all other PEs have finished their exit process and have been cleaned up. When PE 0 is deleted, the apteam membership count goes to zero. The kernel detects the count changing to zero and sends an application termination signal to psched.psched catches the signal, gets the terminating application ID, and removes its internal representation of the application and the kernel's apteam entry. This removes the application from the system.

5.1 Placement Controls Available to the User

Three aprun options control how an application is placed and set up in application space. These are:

Width [-n PEs] to specify how many individual processing elements startup should create (The default value is 1.)
Depth [-d processors] to specify how many processors each PE will use (The default value is 1. The number of processors requested must not exceed the number available on one node -- 4 MSPs or 16 SSPs.)
Shape [-N PEs] to specify how many PEs will be allocated to a node (The default value is 0 which means psched can choose the application to node mapping. This number must be a power of 2 unless it describes the allocation of an application residing on only one node. The value of PEs must not exceed the capacity of one node -- 4 MSPs or 16 SSPs.)

The depth [-d] option is used when multithreaded PEs make up an application. psched will allocate enough resources (the sum of width and depth) to run all of the threads at the same time on separate processors.

The shape option is used to guarantee a certain application layout. This might be useful for run-to-run consistency or to accommodate large per PE memory usage. Without the -N option, psched will choose the number of PEs per node based on what free space happens to be available. Once the application has been given a shape, that information becomes part of the application's distributed memory addressing format and cannot be changed.

6 Gang Scheduling

Distributed memory applications typically have some degree of inter-PE synchronization to maintain orderly progress among the parallel portions of the solution. Implicit in synchronization design is that parallel execution will proceed at the same time. The system guarantees parallel execution by scheduling at once (that is, in a gang) all of the processors and memory that belong to an application. psched will gang schedule an application when there is competition for its resources. If many applications are present, one or more may run at the same time provided that they do not compete with each other. All of the gangs eligible to run at once are named a party. The set of gangs that constitute a party execute at the same time. An application may be promoted from one party to another and so get more processor cycles if it is located such that it does not compete with the other gangs in the time slice.

psched time slices typically range from a few seconds to many minutes. The overhead cost of a context switch between parties primarily involves loading memory for applications about to execute. Distributed memory applications that span more than one node run with their memory locked in place. This is required to satisfy off-node memory references that cannot tolerate page faults.

Context switches are accomplished in two phases:

Sequentially disconnect all members of the running party to stop execution and unlock memory pages.
Connect each member of the next party in parallel. This makes it possible for each node to participate in parallel with the work of loading memory and preparing the gangs to execute.

Unlocked memory pages are only written to disk to make space for an application beginning to execute. Context switching among applications whose memory usage does not exceed the physical capacity of the node does not require any memory loading. The duration of a time slice is partly determined by the time needed to swap memory for a new application. If much memory loading is needed, the time to do this should be factored into the time slice duration to amortize the context switch overhead. Durations can be changed at any time to accommodate changes in system load or to experiment with different scheduling behavior.

7 Migration

Applications in accelerated mode must be placed in contiguous nodes. Migration is the process of moving a running application from one set of nodes to another (possibly overlapping) set of nodes. Migration is needed to merge empty gaps in the node range so larger applications can be placed.

8 Overflow Nodes

Overflow nodes facilitate applications that need all nodes available on the machine. Overflow nodes are optional, but they are not recommended for general use by applications.

Ordinary application nodes are assigned an “application” flavor, and no other. An overflow node is used for special needs and is typically assigned flavors in addition to “application”, such as "support" or "operating system." Overflow nodes are not dedicated to application use, so they must accommodate the operating system and/or support work as well. The heavy use of overflow nodes may impair system response to commands and network traffic.

psched must be configured to provide overflow nodes; but those nodes must be protected from unnecessary use. To do this, an overflow region must be configured in addition to the usual application region, and assigned to the application domain. Nodes can be assigned to a region using either their node numbers or their flavor signature, the method used in most cases because it is easier to configure. The flavor signature of pure application nodes is "ap", and the flavor signature of an overflow node is usually "os su ap".

The overflow region must be protected from applications small enough to be placed without using overflow nodes. This is done by restricting the use of overflow nodes (via the “width” gate) to applications that require them. See the UNICOS/mp Resource Administration Guide for details about setting up overflow nodes and regions.

9 psched and Checkpoint

The checkpoint process does not need intervention from psched. When an application is checkpointed, its apteam state and the psched recovery information attached to the apteam is captured.

A restart of an application by cpr is accomplished with the cooperation of psched. It works as follows:

An apteam entry is created by cpr with flags set so psched does not recognize it as a managed application. The apteam entry is locked. The psched recovery information and other needed fields are set up in the apteam entry.
An RPC request similar to an aprun posting request is sent to psched; and cpr pauses to wait for the placement signal in the same way as aprun does for a posted application. This causes psched to locate the apteam entry and read the application recovery information. An internal representation of the application is constructed so it can be handled in substantially the same way as a posted application. The psched flag is set on the apteam so psched recognizes it as one that it controls.
A placement list is generated when a place to run the application is found. A signal is sent to cpr so it can proceed.
The application is recovered by cpr. While this is going on the apteam entry is held locked so psched will not do anything with it. An aprun waiting to start the application, or the application itself, may be recovered.
When cpr is done, it clears the lock, and psched begins to schedule the application in the normal way.
If the recovered application had been in the pre-launch state where aprun was waiting to start it, psched will send a start signal to aprun so it will continue.

10 psched Administration

A psched administrator is the root user, plus any users with administrative privilege declared in the psched configuration file. The psched daemon is started during network startup, and it should remain present during system operation. The daemon can be controlled through its RPC interface that is used for both posting applications for placement and administrative access. Using the psmgr command, the administrator can alter the configuration variables to control such scheduling features as the time slice or load balancing parameters.

There are a number of administrative and user commands:

psmgr: A general tool that can set and read information internal to psched. This command also sets or clears prime (force run) status on applications.
psview: A shell script that uses psmgr and other commands to provide a set of displays for both users and administrators.
aprecovery: A command intended to clean up kernel information if psched should fail. This command is often needed to allow a psched recovery after a crash.
apmigrate: This command allows the administrator to move applications around in node space. The moves can only be to targets that do not violate any of the placement rules.
apstat: This command uses only the kernel information to report on applications. An application placement map is also provided that is useful in visualizing how applications are placed in node space.
snflv: Node flavors are set using this command.
apkill: the application equivalent to the kill command. Although kill will terminate an application, the apkill command should prevent side effects that can occur if one PE of an application exits while other PEs are accessing its memory.

Please refer to the man pages for command information and the UNICOS/mp Resource Administration Guide for configuration and administrative details.

Notes

placement information: This is an array of structures attached to the apteam that tell the kernel and startup how the application is allocated. Each entry contains a node identifier and a count of the number of SSPs the node will need to use.

apteam: An apteam is the kernel's representation of the application. It is accessed using the application ID (apid). Kernel and other information needed to manage the apteam is contained or attached to the entry. For example, the application recovery information needed by psched and checkpoint is attached to the apteam entry. The apstat command provides various views of one or more apteam entries. The apteam entry is created and destroyed by psched.

placed exec()/fork(): This is an exec() or fork() system call preceded by an apteam SelectPlace request. This tells the kernel to locate the process that is making the call on the specified application node and then complete loading the requested a.out file for an exec or performing the fork.