IRIX Scalability

Gabriel Broner
Silicon Graphics
655F Lone Oak Dr, Eagan, MN 55121, USA
E-mail: broner@sgi.com

CUG Minneapolis, May 1999


Copyright © 1999. Silicon Graphics Company. All rights reserved.


Abstract

This paper describes SGI's approach to scalability using both single systems and clusters. As a result of scalability and reliability enhancements to the IRIX Operating System we are now beta testing 256-processor Origin systems. Further, clustering enhancements have made clustering a practical alternative for scalability.


Contents


1. Introduction

This paper describes SGI's approach to scalability, and the work done in the IRIX operating system to make it possible. Section 2 covers the work in SMP scalability to support large single ccNUMA machines. Section 3 covers the necessary enhancements made in reliability to support large single systems. Section 4 presents scaling as a Cluster. Section 5 discusses the alternatives of scaling as a single system vs. scaling in a cluster environment. Section 6 presents the conclusions.

2. SMP Scalability

SMP scalability refers to the work in the IRIX operating system to support a large single ccNUMA machine with good performance characteristics. This work includes reducing lock contention in the operating system kernel, and restructuring a number operating system data structures. Global data structures have been replaced with local equivalents which greatly reduces the cache coherency traffic for highly contended data structures.

Many sites are currently running IRIX on 128-processor Origin 2000 systems, and since the summer of 1998, we have 256-processor systems available. Support for 256-processor systems was first introduced with IRIX 6.5.3, and the first two sites to run this size of machines were NASA Ames in late 1998, and NCSA in early 1999. After the initial release, a number of enhancements have been added to IRIX, including reducing scaling bottlenecks, scheduling enhancements, and general bug fixes.

Our approach to the IRIX scaling work has been to run large HPC applications on our in-house 256-processor system to identify bottlenecks. The obstacles encountered helped us identify the highest priority scaling work. The SMP scaling work has improved not only the performance of the very large systems, but it has also considerably reduced scaling bottlenecks that are apparent on medium and large IRIX servers (some are issues on 8-cpu servers.)

Large SMP systems support the MPI, shmem and OpenMP programming models. MPI, which was originally intended for clustered machines, runs extremely well on a single machine. Our current MPI implementation on a single machine utilizes shared memory to communicate and achieves message latencies of 5 to 15 microseconds.

3. Reliability (RAS)

As we scale the IRIX system to support larger single system servers, it is important that we also address the issue of system reliability. Because a large system has many hardware parts, it is statistically more likely to encounter a hardware error on a large system than it is in a small system. Software errors are also more frequent, because workloads of large systems are much more demanding than those of small systems. Software bugs such as race conditions are hit much more frequently on large systems. As an example, the first time we ran the AIM 7 benchmark on a 256-processor system, the system crashed after one minute, while the same benchmark takes about a month to crash a medium size server.

We can divide the work done to improve the reliability of IRIX into two general areas:

In the area of software practices, IRIX has gone from a very loose software development and release process, to a much more strict process, where there is emphasis on reliability, testing, code review, and scheduling of code submissions into the system. At the same time, a lot of the IRIX work has focused on actual bug fixing. The overall result is that IRIX 6.5 is the most reliable IRIX release.

In the area of failure isolation features, we are using field system failure data to prioritize resiliency projects. For example, memory errors are comparatively frequent, and IRIX always crashed if a memory error occurred inside the operating system. We found a series of situations where it would be possible to continue operating after a memory failure. Some examples are when a page is being zeroed out, or when a failure occurs in a read-only page that also resides in backing store. Tolerating the more frequent hardware errors improves the overall system reliability. These features are being phased into the IRIX 6.5 releases.

4. Cluster Scalability

Clustering permits scaling beyond the boundaries of a single system, and it does so with good fault containment characteristics. Clusters have traditionally not been very interesting to the HPC community especially because of the low bandwidth and high latency characteristics, divided disks and file systems, and difficulty of programming and administration.

A number of new developments at SGI are making clusters a more interesting alternative. We recently introduced the GSN (Gigabyte System Network) hardware interconnect. GSN offers 600 MBytes/second measured bandwidth and low hardware latency measured at 7 microseconds. IRIX supports the Scheduled Transfer (ST) protocol which allow for user to user communication over GSN without OS intervention. MPI will run on top of ST, and SGI has committed to provide an MPI latency over GSN of under 30 microseconds.

SGI has also developed CXFS, a cluster file system which offers a single file system across a cluster with local file system performance characteristics. >From a usability point of view, CXFS permits accessing the same files and directories from every machine (like NFS, or DFS). From a performance point of view, CXFS is very different. In NFS, requests to access a file are forwarded to the NFS server, which in turn issues the I/O. Latencies are high, and so is the overhead on the NFS server. In CXFS, clients obtain the necessary file system information to perform I/O locally, without going to the server. All disks in CXFS have direct connections to all machines, via a Fibre Channel loop. The end result is local file system speed in most cases, but a cluster global file system.

In the area of cluster resource management, SGI has partnered with Platform Computing to use their workload management system LSF. LSF offers batch submission and it interacts with the IRIX OS resource management facilities.

5. Single System vs. Cluster

The previous sections outline enhancements done by SGI to improve the scalability of a single system, and for scaling as a cluster. Which one is better? The single system will be better in terms of ease of use, administration, programming, access to all files and devices, and performance of certain classes of applications. Specifically we expect large parallel applications that do not impose heavy system call loads to perform extremely well on a single machine. The disadvantage of a large single server will be: 1) the mean time between failures decreases as the system grows in size, and 2) the performance of some classes of applications will be worse than on a cluster. For example, applications that perform a wide variety of system calls, and don't need high bandwidth inter-process communication, may run better on a cluster. So the decision of cluster vs. a large single system depends on how the machine is to be used. At the same time, large SGI Origin systems can be configured flexibly. For example a single 256-processor Origin system can be reconfigured into a cluster of four 64-processor systems. So, if a customer workload profile changes, the system configuration can also change.

6. Conclusions

There have been a number of enhancements to IRIX scalability and reliability, that have permitted us to extend IRIX scalability. There are many 128-processor systems in the field, and a few beta sites are running 256-processor systems. A series of enhancements in the area of clusters have allowed us to offer an interesting cluster solution. In particular MPI over GSN and CXFS will significantly improve the performance of clusters running large parallel applications. Enhancements made in scaling, reliability, and clustering benefit our very high end solutions and also improve our solutions in the general server world.