Cray XT Series System Hardware Monitoring and Management with the XTGUI

Jim Robanske
Cray Inc.
411 First Avenue S, Suite 600 Seattle, WA 98104-2860
(206) 701-2000
Fax: (206) 701-2500

jimr@cray.com

Click here for presentation slides in PDF format. 

ABSTRACT:
The XTGUI (that is, XT Graphical User Interface) is a new tool that makes it relatively simple to monitor and manage Cray XT series hardware.  The XTGUI is a Java client/server based application.  The server side runs on the System Management Workstation (SMW) server, and the client side runs on network-attached workstations.  XTGUI has been released with XT v1.5 software.
 
KEYWORDS:
XT3, XT4, XT Series, hardware, monitoring, management, CRMS
 

1 Introduction

Cray XT series systems can scale from small, single-cabinet systems to enormous systems configured into hundreds of cabinets with hundreds of thousands of components.  All system components are monitored and managed through the Cray RAS and Management System (CRMS), the integrated, independent system that monitors system components, manages hardware and software failures, controls startup and shutdown processes, and manages the interconnection network.. 

The XTGUI tool is a Java based client/server application for the CRMS.  It simplifies the monitoring and management of the XT series hardware.  Released with XT v1.5, the XTGUI provides near real-time monitoring of the status of XT series system components (cabinets, blades, CPUs, seastar processors and seastar links).  It allows quick and easy component fault identification.  The XTGUI also offers the capability to configure and modify the state of system components, and it supports all xtcli commands (with the exception of boot). 


2 Architecture

The XTGUI application consists of a server side, written in a combination of C and Java, and a client side, which is a Java application. 

The server process runs on the Cray XT SMW.  It collects information on all the major components of the XT system (cabinets, blades, CPUs, seastars and seastar links) by listening to CRMS events.  It then forwards this information to all connected XTGUI client processes.  The server also executes xtcli commands on behalf of XTGUI client processes. 

The client side is the user interface portion of the application.  It presents a color-coded, graphical view of the current state of all XT system components; and it offers the capability for the client to send commands back to the SMW to manage XT system components.

3 XTGUI Security

The XTGUI server process operates in either of two modes to allow XTGUI client connections:

XTGUI client password-level access control and encrypted data communications from remote sites can be accommodated.  vncserver is run on a secure workstation or the SMW.  You can ssh into a secure site from remote location and then use vncviewer to run the XTGUI on the secure platform.  The XTGUI application will then be displayed on the remote system.

4 XTGUI Server Process

The XTGUI server process is started on the SMW with a CRMS demon.  The server process provides the following functions:

There are two log files associated with the XTGUI server:

The XTGUI server process reads a properties file (/opt/cray/etc/RsmsJServerProperties.txt) upon startup.  Here, various options may be set to modify the operational behavior of the server.  For example:

# The buffer size (number of lines to read) used when “tailing" server log files.
server.watchedFileBufferSize=200
# The frequency with which log files are examined for new data
server.watchedFileLatencyMilliseconds=5000

 

5 XTGUI Client Process

The XTGUI process is summarized as follows:

        1. The XTGUI client process provides a color-coded visual representation of all major XT system components, clearly illustrating their current status.

        2. The client application may run on any workstation with network access to the SMW. 

        3. The client may be operated in “view only” mode, where no XT configuration actions are allowed. 

        4. The XTGUI supports all xtcli commands, except boot.

Upon client startup, the file RsmsClientProperties.txt (located in the users home directory), is read.  Various user options are saved in this file, as well as a number of client configuration parameters. 




Figure 1: XTGUI Main Frame






6.    Six Main Views of the XTGUI:

The XTGUI is presented in six main  views.  These are described in the following sections.

 

View 1: System Map

The system map is a color-coded view of how cabinets are arranged on the computer floor.  It is presented at the top left of the application frame.  The system map provides a visual representation of each XT cabinet that indicates the state of all major system components.  This allows the system operator to easily see the status of all system components, including any existing error conditions. 

If any subcomponent within a cabinet is under a warning or alert condition, a colored rectangle will indicate such within the affected cabinet on the system map.  For example, In Figure #1, above, a warning condition exists on CPU c3-0c2s0n3.  A corresponding yellow rectangle on the system map indicates an error condition on a subcomponent within cabinet c3-0 .  If the operator clicks on this cabinet, s/he will  see a detailed view that shows all cabinet subcomponents.  Depending on whether the Component Detail subview or the Error List View subview has been selected, clicking on a cabinet in the system map will modify the information presentation in the Active subview.


Figure 2: System Map






View 2: Cabinet Detail

The Cabinet Detail view shows the state of a selected cabinet and its components.  Each blade within the cabinet is shown, including each blade's CPU and Seastar processor.  The state of a Seastar or any of its links is color-indicated.  Depending on whether the Component Detail (see Figure 4, below) or the Error List View subview has been selected, when you click on a component in the Cabinet Detail view, the information in the Active subview is modified  -- that is, the data is sorted to show the selected component at the top of the table, followed by any associated subcomponents.  For example, in Figure 1 above, the CPU in the warning state has been selected.  That component and all other components on that blade are shown at the top of the table in the Component Detail view (Figure 4).  The blade that houses the selected component within the Cabinet Detail view has also been selected.  This makes it easy to keep track of which component has been selected and where it is located within the cabinet. 


Figure 3: Cabinet Detail Map






View 3: Component Detail

The component detail view provides information on each component within the selected cabinet or blade. 


Figure 4: Component Detail


 



When a column header is selected, the table re-sorts according to that column's data.  If the component detail table, system log, console log, or any other tab (except the error list tab) is selected when a cabinet in the system map or a component in the cabinet detail view is displayed, the top of the component detail table will fill with records of the selected cabinet/component and associated subcomponents. 

When in the Component Detail, the right mouse button brings up a menu with the following options:

Clear Reserve

Create Nodelist

Diagnostics

Disable

Enable

Halt

Lock

Partition

Power Up

Power Down

Force Power Down

Reserve

Set Empty

Slot Up

Slot Down

Force Slot Down

View Console Output


 

All menu options in the Component Detail view are context-sensitive, so a selected option will affect only components on which such an operation is possible, and in consideration of the type and state of the selected components.

 

Component Detail Menu Options

When any menu option is chosen, the user must then confirm Yes/No to proceed or not. 


Figure 5: Component Detail Menu Options






Clear Reserve

This option releases a reserved component to normal operation.  As a reserved component can not accept new jobs,  but current jobs are completed.

Create Nodelist

This option creates a list of selected nodes.  If you select a cabinet or blade, all of their component nodes are added to the nodelist.  If you select individual nodes, only they are added to the nodelist.

This option is similar to the Save Node List option on the File menu, except that that option creates a list of all nodes in the entire system. 

By default, this file is written to your home directory and named RsmsNodeList-MMDD-HHMM.SS.txt

 

Diagnostics Menu Options


Figure 6: Diagnostics Menu







A progress dialog window is shown during the the diagnostic test.



Diagnostics Operation

When the diagnostic test completes, a diagnostic summary tab is added to the XTGUI display.  Right-click on this window to bring up options to:




Figure 7: Diagnostic Summary






Detail Menu Options

Disable

If links, nodes, or Cray Seastar chips have hardware problems, you can mark the component as "downed" so that it can not be reallocated into service.

Enable

Re-enable a component and return it to normal operation.

Halt

Stop a component.  The component immediately ceases operation, however any data or processes running on it are lost.

Lock

Locks a component manually.  Components are locked automatically when a command that can change their state is running.  As the command is started, the state manager locks the component so that nothing else can affect the component's state while the command is running.

When a manager is finished with a command, it unlocks the component automatically.  If the manager for some reason fails to unlock the component, it can be unlocked manually with the Show Locks option on the Actions menu .

 

Detail Menu Options

Power Up

Power up a component.  Power commands are hierarchical -- that is, there are a number of ways to power up or power down a lower-level component.  For example, to power up a node, you can power it up directly or power up a component of which it is a part, such as a blade.

Power Down

Power down a component.  Powering down a cabinet powers down all components within the cabinet, including the L0 controllers.

Force Power Down

Force a power down of a component.  If you choose this option, the power manager ignores the operational state of the components that are being powered down. 

Reserve

Reserve a component.  Once a component is reserved, it will not accept new jobs, but any jobs running on the component are completed as normal.



Main View 4: Error List

All components under a current warning or alert status are shown in the error list.  If the error list has been selected and you click on a cabinet in the system map, or on a component in the cabinet detail view, any records on the list will sort so that the selected component and its associated subcomponents are moved to the top.  If you select table column headers, the table will sort on the contents of that column. 

Components are selected with the left mouse button.  When a component is selected, the cabinet detail map displays the cabinet with the selected component, plus the blade that houses it.  This helps locate it. 

The right mouse button brings up a menu with the following options:

  1. Select All

  2. Clear Warning

  3. Clear Alert


Figure 8: Error List






Main View 5: Event Log

The event log displays recent information sent to files monitored by the XTGUI server.  It also displays all commands executed by the XTGUI server on behalf of the XTGUI client.


Figure 9: Event Log






Main View 6: Console Log

The console log displays console log messages from all XT nodes.  You can display console output from selected nodes through the View Console Output menu option on the Component Detail popup menu.


Figure 10: Console Log







XTGUI Window Toolbar Options

Various actions and configuration options are available through the options on the toolbar menus in the upper left corner of the XTGUI window:  Files, Actions, Preferences, and Help, listed here:

File

Actions

Preferences

Help

Save Node List

Exit

Show Components

Show Active Commands

Show Boot Configuration

Show Server Status

Import/Export Sections

Show Locks

General

Connection

Partition Configuration

Help

About



File Menu

Save Node List:

Create a text file listing all CPUs in the system, marking each "n" for empty or disabled, "i" for service, "c" for compute.  For example:

c0-0c0s0n0 i

c0-0c0s0n1 n

c0-0c0s0n2 n

c0-0c0s0n3 i

c0-0c0s1n0 c

c0-0c0s1n1 c

This text file is saved in the users home directory and the naming convention is: RsmsNodeList-mmdd-hhmm.ss.txt

 

Exit:

Exit the XTGUI application.



Actions Menu

Show Components:

Select "Show Components" for a dialog that allows you to select physical or logical groupings of components:


Figure 11: Component Selection



 



For example, if “Service Nodes” was chosen:

 


Figure 12: Component Selection Results







Show Active Cmds

This option displays a table of currently active commands that have been started by the XTGUI client.


Figure 13: Show Active Commands


 



Show Boot Configuration

This option displays a dialog that provides way to show the boot configuration of all XT partitions.


Figure 14: Show Boot Configuration


 



Show Server Status

This option displays a window that provides information on the XTGUI server process and lists all connected XTGUI clients.

 


Figure 15: Server Status






Import/Export Sections

This option displays a dialog to import or export sections of the XT system.  It is enabled only if more than one section has been defined.

 


Figure 16: Import/Export Sections


 

 



Show Locks

Show Locks displays all currently active session locks.  A left-click selects rows in the table.  A right-click pops up a menu allowing you to see the effected components for each session with an option to dismiss the lock.

 


Figure 17: Show Locks


 



Preferences Menu

General

The General option displays a dialog that allows the configuration of three options:

  1. Mouse over mode (which means to automatically switch the cabinet detail window to the cabinet in the system map that the mouse is currently hovering over).

  2. Deiconize on warning/alert.

  3. Tool tip delay.


Figure 18: General Preferences






Connection

Use the Connection option to configure the host name and port number of the primary and secondary SMW systems.


Figure 19: Connection Dialog






Partition Configuration

This option displays the dialog for the definition and modification of partitions.


Figure 20: Partition Configuration






Help Menu

Help

This option displays  the XTGUI online help window.


Figure 21: Online Help


 





About

Select About to display the version number of the XTGUI application in a pop up window.



Future Directions

This document describes the first release of the XTGUI.  The goal in the first release was to provide a tool that would allow quick and simple component-fault identification.  In succeeding releases, a greater variety of system information will be shown by the tool, including a number of environmental attributes such as temperatures, voltages, fan speeds, network counters, and so on.  We invite you to submit suggestions as to how the product could be improved to better meet your needs.



About the Author



Jim Robanske is the lead software engineer for the XTGUI project at Cray Inc. He may be reached at jimr@cray.com.