The first part of this paper provides an update of the current plans for UNICOS/mk on the CRAY T3E system. Included are a discussion of progress towards completion of features related to UNICOS compatibility and the status of production system features in the field (e.g., political scheduling and checkpoint/restart). Development efforts are also discussed.
The second part of this paper discusses the status of performance and scalability from an operating system perspective. UNICOS/mk was designed to be a high performance distributed operating system. Measurements of the CRAY T3E operating system are presented and some comparisons are made with other relevant systems.
There are four general areas that make up the work currently being done for the UNICOS/mk project. All of these topics will be discussed in more detail in later subsections. The four areas are as follows:
The main goal of the project was to make the UNICOS/mk features and functionality equivalent to the content of the UNICOS system; that is, make UNICOS/mk look and work like UNICOS for a MPP platform. The majority of this work is almost done. This effort is shown in figure 1; the first three columns represent all the features currently available in the UNICOS/mk system, while the last column represents the work left to be done.
Figure 1. UNICOS/mk Feature Efforts
The most amount of time and effort lately has been in the area of support. Work is being done to fix system bugs and feature problems, and preparing systems to go to the field.
More time has been spent on resilience lately. Work has now been started to make UNICOS/mk more software and hardware resilient. The goal is to allow a machine running user applications to continue when software/hardware errors occur that previously would have crashed an earlier version of the system.
Finally, performance is an area that needs improvement. The two areas we will be focusing on are scalability and general system performance.
Figure 2 shows the general trend in effort and use of resources for the UNICOS/mk project. As the feature and functionality enhancement work levels off, work on reliability and resiliency increase.
Figure 2. Direction of Effort
The monthly release cycle was planned to slow down at the 1.5 release, but because of feature slips and the need to get critical fixes to the field, this release schedule will continue until 4CY97 (as shown in figure 3). Note that support for a given release will also continue to be limited to the current release until the UNICOS/mk 2.0 release timeframe.
Figure 3. UNICOS/mk Release and Support Schedule
Starting with the UNICOS/mk 2.0 release, the plan is to increase the release cycle to a three month period. Releases after 2.0 could possibly be longer than three months; the exact cycle time will be defined at a later date.
The weekly updates will continue to be produced as needed. This mechanism is a very good way to get critical fixes to the field in a timely manner. Customer feedback validates that it is an effective tool, and it will continue to be used indefinitely.
These features are discussed in more detail in the following paragraphs.
Political scheduling has been available since the UNICOS/mk 1.4.1 release and is being used at several sites. Initially, it had some problems, but they have been resolved since then. It is now functioning very well, and the documentation is being enhanced to describe the functionality in a more useful way.
This feature allows you to run gang scheduling on the application PEs or on both application and command PEs, and allows for load balancing. We believe this is a full-featured scheduling mechanism that includes a distributed version of the UNICOS fair-share scheduler functionality, and we recommend its use. Be aware that to run multi-PE applications with gang scheduling where swapping is required, you will need swap space. It is highly recommended to use fast devices in this case (e.g., FCNs).
Checkpoint/restart development was to be implemented in three phases. The first phase of the work was released in the UNICOS/mk 1.5 release. This work allows an application or program to be checkpointed and restarted, but does not support all of the UNICOS checkpoint functionality. The primary missing piece of functionality in 1.5 is the lack of support for checkpointing open files. Code fragments in the documentation show how applications using open files can still be checkpointed with this 1.5 version by closing/opening them during the checkpoint/restart operation.
The second phase will be released in the UNICOS/mk 1.6 system. This phase is a beta version of the UNICOS-compatible checkpoint/restart. We are looking for sites who would be interested in verifying this version of the feature. Contact Jim Harrell if interested.
The third phase of the implementation will be available in the UNICOS/mk 2.0 release. It will fully compatible with the UNICOS version of checkpoint/restart and fully operational.
Pcache is the feature that is the logical successor to ldcache. It caches physical disk devices, and is associated with a disk server and uses memory on the node with the disk server to cache reads/writes. We have found that even with relatively small caches, performance improvements are substantial. We believe that by using multiple disk servers and pcaches, you can improve I/O performance substantially. Larger memory PEs allow for larger cache size, which gives a boost in the amount of data cached on the node.
Remote mount is available in the UNICOS/mk 1.5 release. It allows you to spread system's file system tree across multiple file servers, each residing on their own PE. Each file server can manage one or more file systems.
Since UNIX is a file-based operating system, it is no surprise to anyone that scalability of the file system would have a significant impact. Thus remote mount is a very important scalability feature. It provides a way to spread CPU utilization and memory buffer caching across multiple PEs. Good performance improvements have been noted in cases where I/O can be spread across multiple file servers that are remotely mounted.
3.2.5 UNICOS/mk 1.6.1 Release
The UNICOS/mk 1.6.1 feature content is as follows:
3.2.7 UNICOS/mk 2.1 Release
The UNICOS/mk 2.1 feature content is as follows
In addition, several features that affect performance are discussed.
The graph in figure 4 is taken from data that we have on system reliability in the field. The numbers themselves may be off by some percentage in either direction, but it is the best data available at this time.
The numbers show for a given size machine, the 13 week rolling system MTTI average. Obviously, the smaller systems have good MTTI, but it goes down as the size of the system increases. Thus, the largest systems present the biggest challenge in this area. History shows that improvement on the large machines has a good effect on the smaller systems as well.
It should be no surprise that the larger system MTTIs are lower. If this graph could be extended back in time, you would see a ramp up of MTTI as each succeedingly larger machine was introduced. Work on improving the large system MTTIs will continue.
Figure 4. CRAY T3E MTTI History
The UNICOS/mk 1.5 release contained a feature called analysis of tolerance of failing compute PEs. This feature is the first part of the software/hardware resiliency effort. What it does is isolate a PE that has failed because of a hardware/software error and allows the system to continue running.
We have been successful in isolating many types of software errors and some of the hardware errors. So far, proving that we can isolate any particular error will take more time. The feature tries to clean up an application/program running on the PEs that have failed. There may be some cases where this needs further work. We plan on extending this feature, as it is the centerpiece of the system resiliency work.
The following areas of performance are of importance:
Application performance was discussed by Kent Koeninger at the San Jose CUG; see his paper for more information on this topic.
Remote mount and FSA were discussed by Mark DuChene and Brian Gaffey at the San Jose CUG; see their paper on their tutorial for more information on this topic. This topic is also discussed briefly in the following paragraphs.
The pcache performance issues are described in more detail in the following paragraphs. For the single disk performance, the differences in Cray products is discussed using fiber disks as an example in the following paragraphs.
Figure 5 shows that when it is possible to direct I/O to a number of file servers, scaling is excellent. With the six file servers that were tested, no roll off in performance was noted.
Figure 5. Remote Mount Performance
Figure 6 illustrates the difference between using raw I/O and using FSA. What is shown is that the preallocated FSA writes and FSA reads scale quite well using increasingly larger numbers of files and PEs, while raw I/O does not scale as well. So, using FSA is an advantage for multi-PE applications because I/O is processed locally. Note that un-preallocated writes currently perform terribly as shown on the bottom line of figure 5. This situation will be improved in a future release.
Figure 6. FSA versus Raw I/O Performance
Figure 7 shows pcache performance using four different disk servers and therefore, four different disk caches. The cache size is 20 MB per PE. This example is something of a best case, given that the file fits into the cache. The two items to note is that performance is good for single cache, but scales well up to the four caches tested.
Figure 7. CRAY T3E pcache Performance
Normally, Cray does not compare performance issues between two architectures, especially when they are as different as the CRAY J90se and the CRAY T3E are. However, it is sometimes useful to compare system activities to determine where improvements can be made.
What is shown in figure 8 is that both system are capable of getting maximum performance out of DD308s for both reads and writes. There is a difference in the rate at which the maximum read performance is achieved. This difference comes well before the 48 Kilobyte point; 48 KB is the buffer size used by the libraries. In order to see this difference, a program would have to do its own I/O buffering.
The difference between these two machines is that for small buffer sizes, the CRAY T3E does not get read performance that the CRAY J90se does. We know that part of this will be regained with the use of the optimized compiler. Also, there are some planned performance changes that will regain part of the difference. Some it may be due to memory speed and interrupt structure. It is important to remember that the CRAY T3E is a distributed system. However, this comparison shows that for general use, the differences are not obvious.
Figure 8. CRAY J90se versus CRAY T3E DD308 Performance
Much work has been done to both port and improve the UNICOS/mk system on the CRAY T3E platform. Much of the feature/functionality work is done or near completion. This will allow us to refocus our efforts and resources on enhancing the resiliency, scalability, and performance of the CRAY T3E system such that it will be a very useful tool for our customers.