UPC: From Beowulfs to the X1

Phil Merkey
Michigan Tech

Houghton, MI 49931, 906-487-2220

merk@mtu.edu

Dave Strenski
Cray Inc

7077 Fieldcrest Road, Suite 202, Brighton, Michigan 48116, Phone (313) 317-4438, stren@cray.com

ABSTRACT:: UPC was developed for the T3E. It is now available on the full spectrum of high performance machines from Beowulf clusters to the X1. This talk will address the issues associated with developing a programming model that runs efficiently on platforms with widely varying characteristics. It will also address some of the payoffs, including the ability to use UPC in the classroom and provide students with hands on experience in High Performance Computing.
KEYWORDS:: UPC, Cray X1, Beowulf Clusters

UPC History and Overview

UPC is an extension to C that provides a shared memory programming model. The first compiler for UPC was developed for the T3E by Bill Carlson and Jesse Draper at IDA/CCS. The first version of the language specification was the collaborative effort of Carlson, Draper, Culler, Yelick, Brooks, and Warren [1]. Since the first UPC workshop in May 2000, the language has evolved through the collaborative efforts of researcher from academics, HPC vendors, and government labs.

All major HPC vendors currently have some UPC development project and there are two university based open source development efforts. UPC and CoArray Fortran are available on the X1 as a compiler options. In the case of UPC it is cc (-h upc). No other vendor has direct support for GAS, Global Address Space, programming models. The effort at Michigan Tech has been to provide a reference implementation of the current specification that would be as platform independent as possible. It is based on an EDG source to source frontend and the a runtime system, currently using MPI, that is being developed at Michigan Tech. Berkeley's publicly available implementation of UPC is part of a bigger effort including GasNet, which provides support for partitioned shared memory on various architectures including clusters, and Titanium, which provides partitioned shared memory on top of java.

UPC provides the user with "partitioned shared memory" and a generic SPMD programming model. The "single program" is referred to as a "thread". In some implementations a thread is a full UNIX process. The current specification does not allow a thread to spawning new threads. The number of threads in an execution is given by the reserved word THREADS. This number can be set at runtime by the program that launches the parallel run, for example aprun, or it can be fixed at compile time. The word MYTHREAD is the reserved word for a thread's identity. One can treat MYTHREAD as int in the range 0, 1, ..., THREADS-1. There is currently no notion of subsets of threads like communicators in MPI.

One of the design philosophies is that UPC should be a minimal extension to C. The developers jokingly once put the "UPC documentation kit" on a business card. A more realistic indication of it complexity is that the UPC section in the C and C++ Reference Manual for the X1 is only 19 pages long. [2] In addition, this minimal philosophy implies that individual threads behave like C programs. The extensions to handle shared memory and to control the interaction among threads should't interfere with the way one thinks about an individual thread.

Before we discuss the UPC commands, there are two issues involving the shared memory that cause some difficulty.

The idea of partitioned shared memory is not new, but we have adopted the new term in attempt to distinguish it from the many other types of shared memory systems built from physically distributed memory. Simply put, if one declares a variable as shared then all the threads have access to it. There is no restriction on mechanism that preforms the access. By default, variables are local---there is no private keyword in UPC. All shared variable have to be declared as shared with global scope or one must have a pointer to a shared space that is allocated at runtime. In addition, partition shared memory introduces the notion of the affinity of a shared variable. Affinity is simply a map from the shared space onto the threads. In a distributed memory system the affinity of a variable and a thread of course corresponds to the node structure of the machine.

Affinity is one of the key ideas in UPC. It is one of the things that distinguishes it from other shared memory programming facilities, for example OpenMP. From an application research point of view, this notion of affinity gives us a way to experiment with two layers of a (shared) memory hierarchy. (This assumes that the multiple layers from registers to cache local memory are better managed by the compiler and the ILP mechanisms.) It also allows a programmer to abstractly control locality and by partitioning programs memory use, the programmer can expose parallelism in a natural way. Affinity usually corresponds with references that are "local" to node on which the thread is running and there is usually a corresponding penalty in latency and bandwidth for off-affinity references. This is much more significant on Beowulf cluster than on an X1. Our ability to handle affinity will determine whether or not we can efficiently run applications on multiple architectures.

To declare a variable as shared one uses the keyword shared as a type qualifier. For example given the declaration

  shared float x;

any thread can read or write the value of x with statements like y=x; or x=5;. As an example of how affinity is controlled consider the following declarations:

  shared float X[N];
  shared [1] float Y[N];
  shared [BLOCK] float Z[N];

The data distribution for A and B are the same. These have a blocking factor of one. The affinity of A[i] is i % THREADS. This is also referred to as the round robin distribution. The Z array is blocked with blocking factor BLOCK. In this case Z[0],z[1],...,Z[BLOCK-1] have affinity to thread 0, Z[BLOCK],z[BLOCK+1],...,Z[2*BLOCK-1] have affinity to thread 1, etc. For matrices, the declarations:

  shared [1] float A[THREADS][THREADS];
  shared [THREADS] float B[THREADS][THREADS];

gives A a column striped distribution and B a row striped distribution. It is clear that if THREADS divides the order of the matrix, then the right choice of blocking factor will produce any row or column, blocked or striped distribution.

Finally note that controlling the blocking factor on arrays is the only control of the data distribution that UPC provides. There is some debate on this point. Having more flexibility in data (affinity) distributions might help application developers. It might also hurt performance. Currently, if you want a more complicated data layout, you have to do the addressing arithmetic yourself. Some programmer dislike this, others like the fact that it is explicit and hence they have complete control over the distribution.

The other subtle point in UPC is the shared memory consistency model. UPC has two modes strict and relaxed. The UPC specification defines strict and relaxed as follows:

References to shared objects shall be either strict or relaxed. For relaxed references there is no change to the C Standard execution model as applied to an individual thread. This implies that translators are free to reorder and/or ignore operations (including shared operations) as long as the restrictions found in [ISO/IEC00: Sec. 5.1.2.3] are observed.
A further restriction applies to strict references. For each strict reference, the restrictions found in [ISO/IEC00: Sec. 5.1.2.3] must be observed with respect to all threads if that reference is eliminated (or reordered with respect to all other shared references in its thread).

It has been shown that strict is not exactly sequential consistency and that relaxed in not exactly weak consistency. [3] Obviously, the relaxed references were defined in order to enable code optimization. Whereas, strict or sequential consistency disallow most of the usual C code optimizations even on memory to which we have affinity.

It is important that these models are well understood, because we want to have UPC programs that behave the same across multiple platforms. The relax model allows for out of order execution, caching and prefetching of non-affinity references. We hope this will allow us compensate for a lower bandwidth, high latency network to give good performance on some applications across multiple platforms. We also need precise understand of the memory model as we introduce additional features that like collective operations and parallel IO.

On the other hand, any UPC program can be executed with only strict references. There may be a loss in performance potential, but it will execute correctly. For this reason, memory models are one of the last things to cover when introducing new students to parallel computing with UPC. Most introductory work can be done using the relaxed memory model and upc_barriers (which are actually a split phase barriers) to synchronize the threads and upc_lock, upc_unlock pairs to protect critical sections. The difficulty with the memory model is seldom the programmer's intuition, rather the legal constraints placed on the implementors and developers.

UPC in the Classroom

We have been using UPC to teach parallel programming in a number of different way over the past two years. At least six masters students have completed their degrees on project that were either directly involved in the development of our open source implementation or on projects related to using UPC. We have had several individual projects, include a math professor and his students that have used UPC to write their first parallel program. Finally, we used it along side MPI in the parallel algorithms course this spring semester. This gives us several different views on the use of UPC in teaching.

A colleague is writing a UPC program to find maximal cliques in a random graph. This is a backtracking algorithm that he converted from his serial C program that he runs on a Linux workstation. This work is being done on a T3E running Carlson's original UPC implementation. The amount of speedup his able to obtain is minimal still because of load balancing issues. The algorithm uses a lexicographic trick to dramatically prune the backtrack tree. Unfortunately, the resulting tree is dramatically skewed, which leads to the load balancing issues. Current research is focused on better ways to prune the tree and better work queues to handle the load imbalance.

In this case, simply the fact that he tried to write a parallel program is a major accomplishment. We credit this to the fact he has access to a T3E, hence significant potential speedup and language which allowed him to incrementally convert his program to a parallel program.

In our parallel algorithms course this spring we used UPC and MPI on a Beowulf cluster as the example language for shared memory and message passing programming, respectively. We used the cluster because: 1) the other machines were tied up with research projects, 2) we wanted the experience of working in an environment that is readily available in the university community, and 3) performance didn't matter. The algorithms that we covered in class were first presented abstractly, from a mathematical or application point of view. Then the serial algorithm was usually presented in C pseudo code. Finally the shared memory and message codes were discussed.

Consider an easy grid based algorithm like solving heat equation on a thin plate. The serial version look like:

	
  while( notdone ){
   for( j=1; j < N-1; j++ ){
      for( i=1; i < N-1; i++ ){
        new_u[i][j] = 0.25*(u[i-1[j]+u[i+1][j]+u[i][j-1]+u[i][j+1]);
      }
   }
   < u[i][j] = new_u[i][j] for all i,j >
   < check if done >
  }

Now if we use a blocked row distribution by declaring u and new_u shared as follows:

  shared [ N/THREADS* N] float u[N][N], new_u[N][N];

Then the main loop look like

  	
  while( notdone ){
   for( j=1; j < N-1; j++ ){
      if( upc_threadof( u[0][j] ) == MYTHREAD ){
        for( i=1; i < N-1; i++ ){
          new_u[i][j] = 0.25*(u[i-1[j]+u[i+1][j]+u[i][j-1]+u[i][j+1]);
        }
      }
   }
   < u[i][j] = new_u[i][j] for all i,j in my block>
   < check if done: compute l2 norm and allreduce >
  }

Domain Decomposition problems like this are easy because in most cases we simply do the same thing except we only do it on the cells for which we have affinity. There are a few collective operations that are required like the allreduce needed to compute the norm, but most of the code is the same. Many students questioned these assignments because they thought they were missing something---the assignments were to easy. It is also easy because we ignore the performance hit of the off-affinity references along the boundary between threads. To address the performance issue, one could turn this into the UPC version of a message passing program by moving the arrays into local memory, setting up shared arrays for the ghostcells and using upc_memcpy to move the ghostcells arrays in bulk. This supports the argument that you can incrementally write a UPC program, as oppose to an MPI which one to do all the memory management and all the messages before anything will run. This also introduces some of the performance differences between UPC on a cluster on a real shared memory machine like the X1.

UPC on Different Architectures

These observations are based on our experiences at MTU and discussions in the UPC workshops and published benchmarking papers. Comparing UPC to MPI, it is clear that we should be able to convert any MPI program to a UPC program written in the message passing style.[4] The only counter argument is that most MPI implementations come with tuned implementations of the collective. The UPC community has recently agreed on a UPC collectives specification [5] and there is a reference implementation but it will be some time before the performance will match the collectives in MPI. The fact that we can match the performance of MPI means that it is unlikely that researchers developing MPI programs will ever convert them to UPC programs. UPC will be used to write new programs or programs that don't perform well in MPI. The memory reference patterns in a UPC programs tend to translate into fine grain communication. This makes it difficult to get good performance on cluster based systems where one tries to have messages at least 1KB long. There is a lot of ongoing research to exploit the memory model to improve this situation. A project at MTU is looking at strategies to do software caching of the remote references to improve performance without forcing the programmer to change the programming style. Researchers at GWU have "hand-coding" optimization strategies into their benchmarks including caching, prefetching, localizations and consolidating strided memory access. We expect that future compilers will be able to do much of this automatically. The Cray X1 is being evaluated by several groups in the UPC community. The Berkeley group will report their findings in June at ICS04.

Summary

There are open source implementations of UPC. These make viable teaching platforms and we anticipate that they will soon match the performance of MPI on clusters for some applications.

Every major HPC vendor either has or is working on support for UPC.

UPC will work better on better hardware. In this case better means hardware that has high bandwidth, low latency memory systems and mechanisms for fine-grain synchronization. Since many MPI programs don't require these capabilities, there is often a cost/performance penalty for running MPI programs on such machines. This is has been a problem in high performance computing for the last several years.

UPC programs can be developed incrementally. This lowers the break-down threshold required to convince people to try parallel programming. In addition, the fact that one can also incrementally tune the performance of a UPC program presents a path from algorithm development and research on high-end systems to more cost effective reproduction code on cluster based systems. The Beowulf community often forgets that that the algorithms that run well in the low bandwidth, high latency cluster environment were first developed on machines that were much more forgiving from a latency, bandwidth, and locality perspective. "One first finds the algorithm and then one finds a way to make it latency tolerant." Having UPC on the full spectrum of machines from the X1 to Beowulf clusters facilitates this migration of algorithm from high-end systems to less expense systems. This can increase the productivity of the entire organization.

References

[1] William W. Carlson, Jesse M. Draper, David E. Culler, Kathy Yelick, Eugene Brooks, Karen Warren. Introduction to UPC and Language Specification, May 13, 1999 (see)
http://www.gwu.edu/~upc/pubs.html

[2] Cray C and C++ Reference Manual (S-2179-52), April 2004.
http://cray.com

[3] W. Kuchera and C. Wallace, "The UPC Memory Model: Problems and Prospects",, IPDPS, 2004.

[4] T.El-Ghazawi and F.Cantonnet, "UPC Performance and Potential: A NPB Experimental Study", SC2002.

[5] "UPC Collective Operations Specifications", UPC Consortium, http://www.gwu.edu/~upc/pubs.html