Migrating Users to the Cray T3E from the T3D and Intel Paragon
Jay Boisseau, Ken Steube, and Max Pazirandeh, San Diego Supercomputer Center
ABSTRACT: SDSC's CRAY T3D and Intel Paragon users were granted access to SDSC's CRAY T3E in January '97. In this paper we describe the issues involved in migrating all users from the T3D and Paragon to the T3E in just a few months. In particular we discuss techniques for porting T3D-specific (e.g. utilizing T3D features still not implemented or Cray PVP front-end features) and Paragon-specific (e.g. utilizing the NX message passing library) user applications and optimizing applications for the T3E. The discussion is sufficiently general to also be of value for T3E sites not migrating users from other platforms.
Many high performance computing centers purchasing CRAY T3E systems will have to address the issues involved in migrating users from other platforms. SDSC recently faced these issues when we migrated users from our Intel Paragon and CRAY T3D systems. The issues in porting from the T3D to the T3E are usually subtle ones due to the change from cf77 to f90 and from a hosted system to a self-hosted system. On the other hand, porting from an Intel Paragon to the T3E can be a much larger task. There are many potential problems that arise in porting codes in either case; our goal is to provide some assistance to those who will help others migrate to T3E systems.
SDSC received its T3E in November 1996. After system testing and acceptance, we opened it to friendly-users for a brief period to prepare the system for production-style use. The system was available to all users on March 3, 1997. The T3D was decommissioned two weeks later when users had some time to port their codes to the T3E. Our Intel Paragon was removed from service to the general public two weeks after that. Thus, in the span of 4 weeks, all MPP users at SDSC had to migrate to the T3E.
SDSC's CRAY T3E has 256 processing elements (PEs). Each PE includes a 300 MHz DEC Alpha 21164 processor with 128 megabytes core memory. 240 of these processors are available to parallel jobs, with the remaining processors devoted to interactive serial jobs and system functions. We have approximately 135 Gbytes of disk storage, the bulk of which is available to users in the /work file system. The /work file system is managed by an automated purge which deletes files as needed if they are not accessed for over 80 hours.
Small jobs (32 processors for less than 60 minutes) may be run interactively, but jobs requiring more resources are forced to run in batch. Similar limits on interactive jobs were imposed on Paragon and T3D users. The purpose of this restriction is to force large jobs to be run in NQS (now called NQE) so that better resource management can be performed.
SDSC's T3D had only 128 processors and T3D jobs had the restriction of having to use 2n processors. Except for a few dedicated runs, the largest jobs typically run on the SDSC's T3D were on 64 processors. SDSC's T3E has twice as many processors and there is no power-of-2 restriction on the number of processors per job on the T3E. SDSC users commonly use up to 128 processors on parallel jobs and can get 192 with relative ease. Dedicated runs on up to 240 processors are also available.
The list of major issues we addressed in porting user codes from the CRAY T3D to the CRAY T3E includes:
The change of shmem_udcflush and shmem_udcflush_line to no-ops might impact certain benchmark programs since they commonly flush the data cache to normalize timing results. Without shmem_udcflush timings will be somewhat less reproducible.
One important difference between the Paragon and the T3E is the difference in data sizes. Be default, floating-point values are 64 bits of precision on the T3E instead of the 32 bits on the Paragon. The T3E does not have double precision and only allows access to 32-bit variables through the compiler's -s default32 option. Also, integer values are 64-bits on the T3E, while they are only 32-bits on the Paragon (see the -i 32 option for help with this). The T3E does not provide 16-bit integers.
Porting a code from a 32-bit system such as the Paragon to a 64-bit system like the T3E can be easy, or it can lead to lots of difficulties. For example, in some C programs pointer values are stored in integer variables. This causes a serious problem on the T3E since integers are only half the precision of a pointer, and the result it s segmentation violation.
This section is far from complete, but it offers a few common problems that may arise in this part of the conversion. On both earlier systems, the Paragon and the T3D, the compiler was a Fortran 77 compiler with some extensions on each platform. The only compiler available on the T3E is Fortran 90-compliant. Some extensions to Fortran 77 had to be removed from users codes or converted to standard Fortran 90 syntax/usage. For example, now functions defined in external files must be declared EXTERNAL. Another difference is that the Paragon bitwise operations had to be converted to the Fortran 90-standard calls (iibits replaces ibits for example). Also on the Paragon, some parameters of the open statement had to be changed to conform to Fortran 90 usage in user codes.
As expected, users reported significant speedups after porting their codes to the CRAY T3E. We have received only preliminary numbers so far, but a sample of these speedups is listed below for codes ported from the T3D to the T3E:
These numbers show a dispersion but are not far fro the average speedup factor of 4 that we expected. We received very little Paragon/T3E performance data for comparison, but one user reported an improvement of a factor of 10 in porting GAMESS from the Paragon to the T3E, which was also expected. (We converted Paragon service units to T3E service units at a 10:1 ratio and T3D service units to T3E service units at a 4:1 ratio; these subsequent reports from users verified the validity of those ratios.)
Sources of performance improvement include the increase in clock speed from 150 to 300 MHz and the implementation of a 96 kbyte level 2 on-chip cache. This cache is three-way set associative and feeds data into the 8 kbyte data and instruction caches you might be familiar with from experience with the CPU used in the T3D. The new CPU also has a Missed Address File, which allows the CPU to avoid some stalls for data loads where the data shares a cache line with another outstanding load. The six stream buffers would provide significant memory speedup if we were able to use them at SDSC. They help manage the latency of an off-page memory reference.
Overall, we were generally pleased with the ease of migrating users from the Intel Paragon and CRAY T3D to the CRAY T3E. The CRAY T3E proved to be relatively stable and offered a fairly complete set of tools to enable users to port their codes. Our users reported only the problems mentioned above, with many users experiencing no problems at all. Users observed significant performance increases immediately, even before optimizing their codes for the different architecture of the T3E processors.