Cray Origin2000 System Support and Infrastructure for ProgEnv and OS
Cray Research, A Silicon Graphics Company
1660 Old Pecos Trail, Suite F, Santa Fe, New Mexico 87505
Cray Research, A Silicon Graphics Company
655F Lone Oak Drive, Eagan, Minnesota 55121
Cray Research is in the process of defining the infrastructure for developing, releasing and supporting software products on the Cray Origin2000 platform. This paper will explore Cray's involvement, current plans and status and open issues, both from a Programming Environment and Operating System perspective.
The requirements for the high end of the market are very different from the requirements of the desktop; therefore the Cray division within Silicon Graphics has been given the responsibility for the technical high performance computing (HPC) portion of SGI's business. For the Origin2000, this has been defined as 33 processors and above. Within this paper, references to Cray Origin2000 translate to those Origin2000 systems that have greater than 32 processors.
The Cray Origin2000 software team has put together a number of objectives for Cray to meet in order to be successful in the technical HPC market. For each objective, we will define the internal mechanisms that we will be using to meet these objectives. Some of these internal mechanisms you may not even be aware of, but these things are the glue that keeps our processes flowing.
The first objective is to continue to track release content, commitments, and problems and fixes. This is critical in allowing us to provide our customers with the exceptional information flow and support that they require. In order to meet this objective, we will continue tracking all features going into each release via a feature database. We can then focus testing and documentation resources on these changes as well as be able to convey the changes to our customer base. We will continue tracking each commitment we make, to completion, to ensure that no commitments are lost; this will be done via a commitment database. Lastly, we will have a single problem and fix tracking tool. This is important so that no problems or fixes are lost, and so that we can easily retrieve a status of every problem and the associated fix.
The second objective identified by the Cray Origin2000 team is to continue to provide high quality releases to our customers. There are a number of internal mechanisms that are described here to allow us to meet this objective.
Our third objective is to continue to provide a high level of support. In addition to some of the processes and tools already mentioned, there are two additional mechanisms we will provide to meet this objective. The first is to continue sending critical information to our customers concerning problems, work-arounds and fixes via our Field Notice mechanism. Other pertinent information will be sent via the Pipeline, which is analogous to Cray's Cray Research Service Bulletin (CRSB). Secondly, we have outlined our responsiveness goals for getting fixes to our customers. This will be defined later in this paper.
Our fourth objective is to create a high-end infrastructure that minimizes duplication of tools, procedures and checkout between Eagan and Mountain View.
Our last objective is to create a basis for future Scalable Node products. We do not want to expend a lot of needless effort creating a process that only works for the current Origin2000 platform. We plan to spend the time and effort ensuring that our processes carry us through to future SN products.
Support Process Strategy
This section of the paper will focus more on the support process itself and how we intend to provide support to the Cray Origin2000 customers.
Problem reporting continues in the same manner as it occurs today. Customers who have an on-site analyst should write an Software Problem Report (SPR) whenever a problem is detected. Customers who do not have an on-site analyst will continue to call a designated call center. That call will then escalate to the correct channels within Cray, and an SPR will be opened and tracked. Escalation of problems from the time they are reported, either via an SPR or a call to the call center, through customer service and finally through engineering has been defined. The Cray Customer Service groups will continue to be the front line support, escalating problems as appropriate to the Cray engineering groups for resolution. Customers will continue to be able to track both problems and fixes via the SPR or CRInform tools.
Cray is utilizing the same fix and delivery mechanism for the operating system as does Mountain View, and that is via binary patches. For the programming environment, fixes will be available via the patch facilities, but it has not yet been defined as to the exact format of those fixes. Other products such as DMF and NQE will offer binary replacements via a ftp server.
As mentioned earlier in this paper, customer support information will be sent to customers via the Field Notice and Pipeline mechanisms as appropriate.
A 3 Tiered Support Approach
All products that are available for the low-end Origin2000 systems are also available on the Cray Origin2000 systems. This is good news for customers because it means commonality across the entire Origin line. However, this creates a much larger support load from a Cray software perspective. For example, Cray has not had to address problems for graphics or web-based tools in the past.
Because Cray does not have expertise in some of these areas, and because problems in these areas are generally of limited interest at the high-end, we are using a tiered approach to supporting the problems which may be reported by our Cray Origin2000 customers.
The first tier are those products which are critical to the majority of Cray Origin2000 customers. This list includes products such as the Cray Message Passing Toolkit (MPT), MipsPro Cray Fortran 90, MipsPro Cray C++, MipsPro C, SCSL, libf90, libu, libc, libC, assembler, loaders, NQE, the IRIX Operating System itself, and others. This is where Cray will expend most of its effort in analyzing, fixing, testing and tracking problems and fixes.
The second tier are those products that are being replaced by other products and therefore phased out. We need to support these products now, but not long term. This list includes Fortran 77 (multiprocessing), ProDev WorkShop tools, Complib, ProMPF Tools, and others.
The third tier are those products with limited interest on the high-end. This list includes Ada 95, ProAda Dev, Adobe, Framemaker, graphics, cosmo and others. These products will still be supported, but some will be escalated straight to Mountain View and others will continue to be supported but no additional effort will be applied by Cray to improve or further strengthen the products.
Cray is still in the process of implementing the various pieces of the infrastructure and support plans. The next two sections discuss the current status of each area. The first section discusses those areas that are in place and operational. The next section will discuss those areas that are not yet completely set up.
Plans in Place
Cray Origin2000 customer problems are currently being reported via site analysts and call centers, escalated through the customer service organization based on severity and expertise, and then handed off to the Cray engineering organization if a true problem or design issue is identified. These problems are being tracked via SPRs. Both the escalation and problem tracking mechanisms are very similar to what is used on our PVP and MPP systems. In many cases, Cray engineers do not yet have the expertise with IRIX and Developer Magic products to resolve identified problems quickly. Thus a mapping of Cray responsible product managers to their Mountain View counterparts has been developed to aid in sharing of expertise and knowledge to quickly and efficiently resolve problems. Currently Cray is supporting 4 customers with up to an additional 16 expected by mid-1997.
Within the High Performance Computing market, known turnaround on reported problems is very important. Thus, Cray has extended the SPR response guidelines used for PVP and MPP software products to Cray Origin2000 reported problems as well. It is not yet known when Cray will be able to start meeting these guidelines and, as with all the platforms, we do not always meet these goals but will be striving to do so. The goals are as follows:
Software Response Guidelines
As well as working to quickly turn around fixes to customer problems, Cray is also actively performing analysis on 64 and 128 processor systems to better understand scalability issues to help increase stability and to understand the current performance of the software on the Origin2000. This information will be used to strengthen the products both in the near term and also help to focus longer term development directions.
To aid in learning about the hardware and software, as well as providing an environment much like that used by our customers, Cray has installed multiple Origin2000 systems in Eagan ranging from 8 to 128 processor systems. These systems are all centrally managed and run either released or pre-released software.
It is extremely important that we Cray/SGI honor all our customer commitments. Thus, we are already tracking all Cray Origin2000 commitments to ensure these agreements are met.
Plans Not Yet Completed
There is still much work to do to create a solid infrastructure and support process to provide quality customer support for high end Origin 2000 customers. Progress is being made in many areas although there is still work to do. This section will discuss those processes and mechanisms that are not yet completely operational.
On the problem tracking front, neither Cray's nor SGI's problem tracking systems meet the needs of the entire company. We therefore are working on a single problem tracking database and toolset. Until this is available, we will continue to use our separate mechanisms to minimize change and interruption at all levels of engineering with SGI. However, as a consequence, we must maintain much of the same information in two databases so that no information is lost. Cray is using SPR to track Cray Origin2000 problems, but much of the information must be manually updated and monitored. We cannot continue to operate in this mode and are actively working to further automate the process. Additionally, we need to choose and begin gathering information on metrics to track responsiveness.
Along the same lines, SGI/Cray is working to combine and streamline information on hardware and software issues currently provided via CRSB and Pipeline articles as well as Cray Field Notices (FN) and several internal SGI communication mechanisms. Cray is also increasing its involvement in resolving customer reported problems as it gains expertise with the various tools on the system and IRIX. This allows Cray to ensure quicker turnaround on fixes and also pinpoint areas of concern in meeting high-end customer needs.
We are beginning to release products that contain both SGI and Cray components, such as Fortran 90. This presents a new challenge for Cray in providing fixes for the entire SGI product line, not just for Cray platforms. We are still developing a mechanism for incorporating this more generalized support into the existing Cray development structure.
In support of release and delivery, Cray is working to develop a common mechanism for tracking features and release content, arranging for field tests specific to high-end customers, working to consolidate documentation and ensure users are able to access this SGI documentation from multiple workstation platforms. Additionally, there is much effort being applied to ensuring that both IRIX and asynchronous software are pre-installed on each system and verified prior to shipment as is the case for traditional Cray systems.
In conclusion, Cray is making good progress in creating an environment to support Cray Origin2000 customers but there is much work to do. The process will continue to evolve as we learn more about what works effectively and we work to merge Cray and SGI processes. As we continue to learn about the Origin2000 product we need to ensure that problems found on Cray Origin2000 machines are reported solely to Cray for resolution. Without this, we will be less effective in supporting customers needs: we will not be able to monitor stability via metrics information and we will be slowed in our ability to create an effective environment to support high-end Origin 2000 computing. Lastly, we need your help to keep us informed about what is effective and where we need to focus additional efforts to ensure high quality products and support for Cray Origin2000 systems.