Jim Harrell
Silicon Graphics, Inc.
655-F Lone Oak Drive
Eagan, Minnesota 55121
ABSTRACT:
IRIX for Origins with more than 64 processors was released early in 1998. This talk covers the status of the "High-End" Origins in the field and the plans for software releases of IRIX that will support functionality important to the larger machines. The next releases of IRIX will be using a new model for software development and release. An overview of the release model is included in the discussion of plans.
This talk covers five separate areas of interest to "High-End" Origins, those Origins with more than 64 processors. The first area is the status of the machines in the field. The status is based on information available to Silicon Graphics. The second area is the release plans. The plans for the next releases supporting greater than 64 processors and the changes in the IRIX development and release model are discussed. A third area covers the feature plans relevant to large system sites. This is aimed at sites migrating from Unicos to IRIX and focuses on features that are ported to IRIX from Unicos. A fourth area covers futures and is a view into the current work going on in operating system development. Finally, a discussion of issues. At this time there is one primary issue and that is the need for information from large Origin sites about the reliability of their systems.
Origin systems have been shipping for some time, but the support for large Origins has been more recent. The larger systems required more time to complete the support for the MetaRouter and the IRIX support for the larger processor counts. The Manufacturing Release, MR, for these systems was in April of 1998. These systems are running the 6.5-SE IRIX release. This is a special release of IRIX produced to support the large Origins.
The large Origin systems, greater than 64 processors, are distinguished by the presence of a MetaRouter. The MetaRouter provides the interconnection capacity for systems with more than 64 processors. Because of the large number of Origin systems and the ability to easily add or remove groups of processors it is easiest to count the large systems by the number of MetaRouters. Currently there are 45 systems with a MetaRouter and more than 64 processors. An additional 50 to 60 systems are expected to ship in calendar 1998. This is a small percentage of the 18000 Origin systems already installed. An important aspect of this is the difference in software exposure from Cray Unicos systems. A Cray system running Unicos might only have 50 to 100 installations. The amount of software exposure is limited to these sites. But when 18000 systems can be running the same operating system and associated software the amount of code coverage and resulting confidence is substantially higher. It is interesting to note that there are 16 of Priority 1, 2, and 3 bugs reported against 6.5-SE. This is a substantially smaller number of bugs than has been usually associated with a new Cray system.
A new version of IRIX, 6.5, was released in May. This version supports all current platforms with the exception of large Origins. As mentioned previously the current release of IRIX for large systems is 6.5-SE. The 6.5-SE release was an early Beta version of 6.5, plus changes to support the larger systems. The 6.5-SE release supports only large Origins. This release was on time, and represents an effort in software development to be on schedule. A group of patches was released on 1 June 98 that fixed several problems in 6.5-SE and is recommended for use on all 6.5-SE systems. The 6.5-SE release will be replaced by the 6.5.1m maintenance release. Support for 6.5-SE will end one month after the release of 6.5.1m.
The next release for large Origins is 6.5.1m. This will be an all platform release. The planned release date is in early August 1998. The 6.5.1m release will merge the 6.5-SE with the base 6.5 release, support for new hardware since 6.5 released, and adds many bugfixes.
There are changes in the Software Development and release processes that are underway. The 6.5 release is the first of the new release "family" mechanism under development. The next part of this discussion is an overview and status of those changes.
Both Cray Research and Silicon Graphics have been working on new software development and release processes. In general, the changes both organizations were making had very similar purposes. The new SGI process model will integrate the Cray and SGI models into one process strategy. The first basic purpose of the new model is to meet the need for shortened time to market, with predictable releases and delivery of software. This will increase the responsiveness to market opportunities, allow customers to predictably plan software upgrades. There has been a strong request from customers to provide compatibility across a family of platforms and releases. This compatibility is included in the goals of the process changes. Both Cray and SGI have been working on process changes that would improve software quality and customer satisfaction. Finally, as the companies have been merging processes there is a need to align the development and release processes with the product development processes. SGI is adopting a new product development process called Integrated Product and Process Development, IPPD. This new process is similar to processes for product development in both Cray and SGI. The new software development and release processes will be integrated with the IPPD processes.
The first goal of the software development changes, that is the changes that software developers are making in the way they do their work, is to increase the maturity of software at initial release. The next goal is to provide both backward and forward binary compatibility for user applications across a given family of platforms and releases. As mentioned in the process changes section earlier one of the goals is to have software products delivered with a predictable schedule. In order to make this change man other changes are underway in software development. The new model for software development will have a global criteria for quality and compatibility, but will encourage the responsible development organization to take and maintain ownership of the product. Finally, in order to be able to measure improvement development will use quantifiable metrics.
The new releases will be comprised of a release family. The release family is composed of a major release, followed by a group of intermediate releases, and patches (if required).
A major release, like 6.5, is a foundation for the other release types. This release type provides a place for major architectural changes, in software. An example would be reorganizing VM (not that this is planned.). A major release provides an End of Life point for hardware and software. The major release is infrequent, on the order of 18 months to two years.
The intermediate release is a subtype of the major release. The intermediate releases are frequent, stable, and are released at predictable intervals between major releases. There are two types of intermediate releases. The first type is the maintenance release. This release type provides fixes since the last release and support for new hardware. The second type of intermediate release is the feature release. This release provides early access to new software features, along with support for new hardware and fixes. The intermediate releases will be delivered at the same time.
An older release mechanism is patches. This is a bugfix or bugfixes for critical problems. These will be released only as required for critical problems in the future.
Currently the changes are being moved from definition to implementation with 6.5.1. There is a great deal of work underway, and many issues left to resolve. As we move forward further details will be made available.
A number of features relevant to the large Origin customers are being developed. Some of these are ported features from Unicos, and others are IRIX features. The IRIX scheduler Miser is available in 6.5. Share II is available with 6.5. Partitioning support, which allows systems to be partitioned into multiple systems is in 6.5, with support for more than 64 processors in 6.5.1. Unicos features such as limits, UDB, and Unicos accounting are to be released in the first half of 1999.
In the platform organization of software development we are working on a number of projects we believe will be important for the future of large scale systems. We are currently prototyping on an Origin 256 processor machine in order to work through issues of scalability in preparation for Cellular Irix and SN1. We are proceeding wit the MIPS R12K "TREX" development and the designs and initial SN1 simulations are underway. Some of the initial work for SN2 is in progress. Our plan is to use IRIX as the operating system for large scale SN2 systems.
It has been our normal procedure to give MTTI statistics at Cray User Group meetings. We are monitoring MTTI on large Origins, but do not have complete information at this time. We believe the MTTI is growing rapidly. Our belief is based on the few bug reports, and the available MTTI data. We do need help from sites to make the MTTI data available to us. Our goal is to ensure that we are able to respond to issues, and plan for the future.