CUG 2009

Abstracts of Papers
Presentation Authors and Titles	Abstract
Lucian Anton, HPCX Consortium (HPCX), Dario Alfe, University College of London, Randolph Hood, Lawrence Livermore National Laboratory, and David Tanqueray, Cray UK Limited, Improving CASINO Performance for Models with Large Number of Electrons	CASINO is used for quantum Monte Carlo calculations which have at their core algorithms that use sets of independent multidimensional random walkers and which are straightforward to use on parallel computers. However, some computations have reached the limit of the memory resources for models describing more than 1000 electrons because of the large amount of electronic orbital related data that is needed on each core for the computation. Besides that, for models with large number of electrons it is interesting to study whether the swap over one configuration can be done in parallel in order to improve computation speed. We present a comparative study of several ways to solve these problems: i) Second level parallelism for configuration computation, ii) distributed orbital data done with MPI or unix System V shared memory and iii) mixed mode programming (OpenMP and MPI).
Katie Antypas, John Shalf, Shane Cannon, and Andrew Uselton, National Energy Research Scientific Computing Center (NERSC), MPI-I/O on Franklin XT4 System at NERSC	As we enter the petascale computing era, the need for high performing shared file I/O libraries becomes more urgent. From simpler file management and data analysis to the portability and longevity of data, parallel shared file I/O libraries increase scientific productivity and the potential for sharing simulation data. However, the low performance of MPI-IO on the Lustre file system makes it difficult for scientists to justify switching from a one-file-per processor I/O model, despite all the benefits of shared files for post-processing, data sharing and portability. This paper will discuss some of the reasons and possible solutions for low MPI-IO performance on the Lustre file system and the implications for higher level self-describing parallel I/O libraries such as HDF5 and Parallel-NetCDF.
Mike Ashworth and David Emerson, HPCX Consortium (HPCX) and Mario Chavez and Eduardo Cabrera, UNAM, Mexico,Exploiting Extreme Processor Counts on Cray XT Systems with High-Resolution Seismic Wave Propagation Experiments	We present simulation results from a parallel 3D seismic wave propagation code that uses finite differences on a staggered grid with 2nd order operators in time and 4th order in space. We describe optimizations and developments to the code for the exploitation of extreme processor counts. The ultra high resolution that we are able to achieve enables simulations with unprecedented accuracy as demonstrated by comparisons with seismographic observations from the Sichuan earthquake in May 2008.
Mike Ashworth, Phil Hasnip, Keith Refson, and Martin Plummer, HPCX Consortium (HPCX), Band Parallelism in CASTEP: Scaling to More Than 1000 Cores	CASTEP is the UK's premier quantum mechanical materials modeling code. We describe how the parallelism is implemented using a 3-layer hierarchical scheme to handle the heterogeneously structured dataset used to describe wave functions. An additional layer of data distribution over quantum states (bands) has enhanced the scaling by a factor of 8, allowing many important scientific calculations to efficiently use thousands of cores on the HECToR XT4 service.
Mike Ashworth, Andrew Sunderland, Cliff Noble, and Martin Plummer, HPCX Consortium (HPCX),Future Proof Parallelism for Electron-Atom Scattering Codes on the XT4	Electron-atom and electron-ion scattering data are essential in the analysis of important physical phenomena in many scientific and technological areas. A suite of parallel programs based on the R-matrix ab initio approach to variational solution of the many-electron Schrodinger equation has been developed and has enabled much accurate scattering data to be produced. However, future calculations will require substantial increases in both the numbers of channels and scattering energies involved in the R-matrix propagations. This paper describes how these huge computational challenges are being addressed by improving the parallel performance of the PRMAT code towards the petascale on the Cray XT4.
Troy Baer, National Institute for Computational Sciences (NICS) and Don Maxwell, Oak Ridge National Laboratory (ORNL), Comparison of Scheduling Policies and Workloads on the NCCS and NICS XT4 Systems at Oak Ridge National Laboratory	Oak Ridge National Laboratory (ORNL) is home to two of the largest Cray XT systems in the world: Jaguar, operated by ORNL's National Center for Computational Sciences (NCCS) for the U.S. Department of Energy; and Kraken, operated by the University of Tennessee's National Institute for Computational Sciences (NICS) for the National Science Foundation. These two systems are administered in much the same way, and use the same TORQUE and Moab batch environment software; however, the scheduling policies and workloads on these systems are significantly different due to differences in allocation processes and the resultant user communities. This paper will compare and contrast the scheduling policies and workloads on these two systems.
Ann Baker, Oak Ridge National Laboratory (ORNL), Chair, XTreme	This group works very closely with Cray, under Non-Disclosure Agreements, to provide valuable input into the Cray XT system development cycle. For this reason, these are "closed door" sessions.
Ben Bales and Richard Barrett, Oak Ridge National Laboratory (ORNL), Interesting Characteristics of Barcelona Floating Point Execution	In almost all modern scientific applications, developers achieve the greatest performance gains by tuning algorithms, communication topographies, and memory access patterns. For the most part, instruction level optimization is left to compilers. With increasingly varied and complicated architectures, it has become extraordinarily unclear what benefits these low level code changes can even bring, and, due to time and complexity constraints, many projects can not find out. In this paper we explore the gains of this last mile effort for code executing on an AMD Barcelona processor, leaving readers able to decide if investment in advanced optimization techniques make sense for their codes.
Robert Ballance, Sandia National Laboratories (SNLA), Chair, Applications and Programming Environments SIG	The Applications and Programming Environment SIG welcomes attendees with a focus on compilers and programming environments. Topics include compilers, scientific libraries, programming environments and the Message Passing Toolkit. SIG business will be conducted followed by open discussions with other attendees as well as representatives from Cray. All attendees are welcome to participate in this meeting.
Stephen Bique and Robert Rosenberg, Naval Research Laboratory (NRL), Fast Generation of High-Quality Pseudorandom Numbers and Permutations Using MPI and OpenMP on the Cray XD1	Random number generators are needed for many HPC applications such as real-time simulations. Users often prefer to write their own pseudorandom number generators. We demonstrate simple techniques to find and implement fast, high-quality customized parallel pseudorandom number generators and permutations. We present results of runs on the Cray XD1.
Arthur Bland, Douglas Kothe, Galen Shipman, Ricky Kendall, and James Rogers, Oak Ridge National Laboratory (ORNL), Jaguar: The World's Most Powerful Computer	The Cray XT system at ORNL is the world's most powerful computer with several applications exceeding one-petaflops performance. This talk will describe the architecture of Jaguar with combined XT4 and XT5 nodes along with an external Lustre file system and login nodes. The talk will also present early results from Jaguar.
Dan Bonachea and Paul Hargrove, Lawrence Berkeley National Lab and Michael Welcome and Katherine Yelick, National Energy Research Scientific Computing Center (NERSC), Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT	Partitioned Global Address Space (PGAS) Languages are an emerging alternative to MPI for HPC applications development. The GASNet library from Lawrence Berkeley National Lab and the University of California Berkeley provides the network runtime for multiple implementations of four PGAS Languages. This paper describes our experiences porting GASNet to the Portals network API on the Cray XT series.
Ron Brightwell, Sue Kelly, and Jeff Crow, Sandia National Laboratories (SNLA), Catamount N-Way Performance on XT5	This paper provides a performance evaluation of the Catamount N-Way (CNW) operating system on a dual-socket quad-core XT5 platform. CNW provides several operating system-level enhancements for multi-core processors, including the SMARTMAP technology for single-copy MPI messages and the ability to easily choose between small and large memory pages. Our evaluation will include an analysis of the performance of important micro-benchmarks and applications.
Ron Brightwell and Mike Heroux, Sandia National Laboratories (SNLA) and Al Geist and George Fann, Oak Ridge National Laboratory (ORNL), DOE IAA: Scalable Algorithms for Petascale Systems with Multicore Architectures	The DOE Institute for Advanced Architecture and Algorithms (IAA) was established in 2008 to facilitate the co-design of architectures and applications in order to create synergy in their respective evolutions. Today's largest systems already have a serious gap between the peak capabilities of the hardware and the performance realized by high performance computing applications. The DOE petascale systems that will be in use for the next 3-5 years have already been designed so there is little chance to influence their hardware evolution in the near term. However, there is considerable potential to affect software design in order to better exploit the hardware performance features already present in these systems. In this paper, we describe the initial IAA project, which focuses on closing the "application-architecture performance gap” by developing architecture-aware algorithms and the supporting runtime features needed by these algorithms to solve general sparse linear systems common in many key DOE applications.
Shane Canon, Matt Andrews, William Baird, Greg Butler, Nicholas P. Cardo, and Rei Lee, National Energy Research Scientific Computing Center (NERSC), GPFS on a Cray XT	The NERSC Global File System (NGF) is a center-wide production file system at NERSC based on IBM's GPFS. In this paper we will give an overview of GPFS and the NGF architecture. This will include a comparison of features and capabilities between GPFS and Lustre. We will discuss integrating GPFS with a Cray XT system. This configuration relies heavily on Cray DVS. We will describe DVS and discuss NERSC’s experience with DVS and the testing process. Performance and scaling for the configuration will be presented. We will conclude with a discussion of future plans for NGF and data initiatives at NERSC.
Nicholas P. Cardo, CUG Vice-President, National Energy Research Scientific Computing Center (NERSC), Chair, Open Discussion with CUG Board	The CUG Board members are making themselves available for a period of open discussions. Site representatives are encouraged to attend to engage in discussions with the board members as well as other site representatives.
Nicholas P. Cardo, National Energy Research Scientific Computer Center (NERSC), Wrapping/Shepherding APRUN	The APRUN command is used to launch applications onto compute nodes. By adding a wrapper around APRUN or a shepherd process to run in parallel to APRUN it is possible to add new functionality for site use. Such enhancements could include the scanning for dangerous applications or the enforcement of site specific limits. Another benefit is the ability to monitor messages to stderr in order to trap specific error conditions to identify application success rates. Details of the APRUN wrapper/shepherd running at the National Energy Research Scientific Computing Center will be discussed in this papers.
Charlie Carroll, Cray Inc., Cray Operating System Road Map	Cray continues to improve and advance its system software. This paper and presentation will review progress over the past year and discuss new and imminent features with an emphasis on increased stability, robustness and performance.
Yousu Chen, Shuangshuang Jin, Daneil Chavarria, Zhenyu Huang, Pacific Northwest National Laboratory (PNNL), Application of Cray XMT for Power Grid Contingency Selection	Contingency analysis is a key function to assess the impact of various combinations of power system component failures. It involves combinatorial numbers of contingencies which exceed the capability of computing power. Therefore, it is critical to select contingency cases within the constraint of computing power. This paper presents a contingency selection method of applying graph theory (betweenness centrality) to power grid graph to remove low-impact components using Cray XMT machine. The implementation takes the advantage of the graph processing capability of Cray XMT and its programming features. Power grid sparsity is explored to minimize memory requirements. The paper presents the performance scalability of Cray XMT and comparison with other multi-threaded machines.
Shreyas Cholia and Hwa-Chun Wendy Lin, National Energy Research Scientific Computing Center (NERSC), Integrating Grid Services into the Cray XT4 Environment	The 38640 core Cray XT4 "Franklin" system at NERSC is a massively parallel resource available to Department of Energy researchers that also provides on-demand grid computing to the Open Science Grid. The integration of grid services on Franklin presented various challenges, including fundamental differences between the interactive and compute nodes, a stripped down compute-node operating system without dynamic library support, a shared-root environment and idiosyncratic application launching. In our work, we describe how we resolved these challenges on a running, general-purpose production system to provide on-demand compute, storage, accounting and monitoring services through generic grid interfaces that mask the underlying system-specific details for the end user.
James Craw, Nicholas P. Cardo, and Helen He, National Energy Research Scientific Computing Center (NERSC) and Janet Lebens, Cray Inc., Post-Mortem of the NERSC Franklin XT Upgrade to CLE 2.1	This paper will discuss the lessons learned of the events leading up to the production deployment of CLE 2.1 and the post install issues experienced in upgrading NERSC's XT4 system called Franklin.
Lonnie Crosby, National Institute for Computational Sciences (NICS), Performance Characteristics of the Lustre File System on the Cray XT5 with Respect to Application I/O Patterns	As the size and complexity of supercomputing platforms increases, additional attention needs to be given to the performance challenges presented by application I/O. This paper will present the performance characteristics of the Lustre file system utilized on the Cray XT5 systems and illuminate the challenges presented by applications that utilize tens of thousands of parallel processes.
Steve Deitz, Brad Chamberlain, Samuel Figueroa, and David Iten, Cray Inc., HPCC STREAM and RA in Chapel: Performance and Potential	Chapel is a new parallel programming language under development at Cray Inc. as part of the DARPA High Productivity Computing Systems (HPCS) program. Recently, great progress has been made on the implementation of Chapel for distributed-memory computers. This paper reports on our latest work on the Chapel language. The paper provides a brief overview of Chapel and then discusses the concept of distributions in Chapel. Perhaps the most promising abstraction in Chapel, the distribution is a mapping of the data in a program to the distributed memory in a computer. Last, the paper presents preliminary results for two benchmarks, HPCC STREAM Triad and HPCC RA, that make use of distributions and other features of the Chapel language. The highlights of this paper include a presentation of performance results (2.78 TB/s on 4096 cores of an XT4), a detailed discussion of the core components of the STREAM and RA benchmarks, a thorough analysis of the performance achieved by the current compiler, and a discussion of future work.
Steve Deitz, Cray Inc., Introduction to Chapel–A Next-Generation HPC Language	Chapel is a new parallel programming language under development at Cray Inc. as part of the DARPA High Productivity Computing Systems (HPCS) program. Chapel has been designed to improve the productivity of parallel programmers working on large-scale supercomputers as well as small-scale, multicore computers and workstations. It aims to vastly improve programmability over current parallel programming models while supporting performance and portability at least as good as today's technologies. In this tutorial, we will present an introduction to Chapel, from context and motivation to a detailed description of Chapel concepts via lecture and example computations. We will also perform a live demonstration of using Chapel on a laptop to compile and run a sample program. Although not part of the tutorial, we will also provide interested audience members with several Chapel exercises in the form of handouts. We'll conclude by giving an overview of future Chapel activities.
Luiz DeRose, Cray Inc., The Cray Programming Environment: Current Status and Future Directions	The Cray Programming Environment has been designed to address issues of scale and complexity of high end HPC systems. Its main goal is to hide the complexity of the system, such that applications can achieve the highest possible performance from the hardware. In this paper we will present the recent activities and future directions of the Cray Programming Environment, which consists of state of the art compiler, tools, and libraries, supporting a wide range of programming models.
Thomas Davis and David Skinner, National Energy Research Scientific Computing Center (NERSC), System Monitoring Using NAGIOS, Cacti, and Prism	This paper will examine the issues in monitoring a large (9k+ node) installation using NAGIOS and Cacti. Thermal, Fault, and Performance, display, and notification will be examined.
Luiz DeRose and John Levesque, Cray Inc., A Methodical Approach for Scaling Applications to 100,000 Cores	In this tutorial we will present tools and techniques for application performance tuning on the Cray XT system, with focus on multi-core processors. The tutorial will consist of a brief discussion of the Cray XT architecture, focusing on aspects that are important for understanding the performance behavior of applications; followed by a presentation on the Cray performance measurement and analysis tools; and concluding with optimization techniques, including discussions on MPI, numerical libraries and I/O.
David Fellinger, DataDirect Networks, Inc., Exploring Mass Storage Concepts to Support Exascale Architectures	There are many challenges that must be faced in creating an architecture of compute clusters that can perform at multiple petaflops and beyond to exascale. One of these is certainly the design of an I/O and storage infrastructure that can maintain a balance between processing and data migration. Developing a traditional workflow including checkpoints for simulations will mean an unprecedented increment in I/O bandwidth over existing technologies and designing a storage system that can contain and deliver the product of the computation in an organized file system will be a key factor in enabling these large clusters. Potential enabling technologies will be discussed which could achieve dynamic scaling in both I/O bandwidth and data distribution. Concepts will also be presented that could allow the distribution elements to act upon the data to simplify postprocessing requirements and scientific collaboration. These technologies, if affected, would help to establish a much closer tie between computation and storage resources decreasing the latency to data assimilation and analysis.
Manisha Gajbe and Richard Vuduc, Georgia Institute of Technology and Andrew Canning, Lin-Wang Wang, John Shalf, and Harvey Wasserman, National Energy Research Scientific Computing Center (NERSC),Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4	In this work we show optimization, performance modeling and auto tuning, of 3 Dimensional Fast Fourier Transform on Cray XT4 (Franklin) system. At the core of many real-world scientific and engineering applications is the necessity for computing 3D FFT. FFT is a very commonly used numerical technique in computational physics, engineering, chemistry, geosciences, and other areas of high performance computing. The FFT has many properties useful in both engineering and scientific computing applications (example include molecular dynamics, material science, Fluid Dynamics etc.). The problem with a parallel FFT is that the computational work involved is O(NlogN) while the amount of communication is O(N). This means that for small values of N (64 x 64 x 64 3D FFTs), the communication costs rapidly overwhelmed the parallel computation savings. A distributed 3D FFT represents a considerable challenge for the communications infrastructure of a parallel machine because of the all-to-all nature of the distributed transposes required, and it stresses aspects of the machine that complement those addressed by other benchmark kernels, such as Linpack, that solves system of linear equations, Ax = b. In addition, by the examination of the key characteristics of an application kernel, an analytical performance model is formed. The performance model can be a very useful tool for predicting the performance of many scientific applications that use 3 dimensional Fast Fourier Transforms. This model can be used to explore the achievable performance on future systems with increasing computation and communication performance.
James Glidewell, The Boeing Company (BOEING), Chair, User Support SIG	The User Support SIG welcomes attendees interested in discussion issues with regard to supporting their user community. SIG business will be conducted followed by open discussions with other attendees as well as representatives from Cray. All attendees are welcome to participate in this meeting.
Chris Gottbrath, TotalView Technologies, Debugging Scalable Applications on the XT	Debugging at large scale on the Cray XT can involve a combination of interactive and non-interactive debugging; the paper will review subset attach and provide some recommendations for interactive debugging at large scale, and will introduce the TVScript feature of TotalView which provides for non-interactive debugging. Because many users of Cray XT systems are not physically co-located with the HPC centers on which they develop and run their applications, the paper will also cover the new TotalView Remote Display Client, which allows remote scientists and computer scientists to easily create a connection over which they can use TotalView interactively. The paper will conclude with a brief update on two topics presented at the previous two CUG meetings: memory debugging on the Cray XT and Record and Replay Debugging.
Alan Gray, HPCX Consortium (HPCX), An Evaluation of UPC in the Ludwig Application	As HPC systems continue to increase in size and complexity, Unified Parallel C (UPC), a novel programming language which facilitates parallel programming via intuitive use of global data structures, potentially offers enhanced productivity over traditional message passing techniques. Modern Cray systems such as the X2 component of HECToR, the UK's National High Performance Computing Service, are among the first to fully support UPC; the XT component of HECToR will offer such support with future upgrades. This paper reports on a study to adapt Ludwig, a Lattice Boltzmann application actively used in research, to utilize UPC functionality: comparisons with the original message passing code, in terms of performance (on HECToR) and productivity in general, are presented.
James J. Hack, Director, National Center for Computational Sciences, Oak Ridge National Laboratory (ORNL), Invited Talk: Challenges in Climate Change Research: Computation as an Enabling Technology	The problem of quantifying the consequences of climate change over the next few decades is motivated by the increasingly urgent need to adapt to near term trends in climate and to the potential for changes in the frequency and intensity of extreme events. There are significant uncertainties in how climate properties will evolve at regional and local scales, where the background signal of natural variability is large. Consequently, the climate community is now facing major challenges and opportunities in its efforts to rapidly advance the basic science and its application to policy formation. Meeting the challenges in climate change science will require qualitatively different levels of scientific understanding, numerical modeling capabilities, and computational infrastructure than have been historically available to the community. This talk will touch on how climate science will need to develop in the context of the community’s scientific capabilities as assessed by the recent IPCC AR4 activity, and where it will need to be in order to accurately predict the coupled chemical, biogeochemical, and physical evolution of the climate system with the fidelity required by policy makers and resource managers.
External Services, Jim Harrell, Cray Inc.	The Cray XT architecture separated service from compute. This was a well worn path used by MPP systems in the past but important for scaling. Now a number of customers and Cray are working on making the service nodes even more separate from the compute nodes by using more standard servers that are outside the High Speed Network. This BOF will offer an opportunity for Cray and customers to share experiences and directions for External Service Nodes.
Yun (Helen) He, National Energy Research Scientific Computing Center (NERSC), User and Performance Impacts from Franklin Upgrades	The NERSC flagship computer named "Franklin," the 9660 compute nodes Cray XT4, has gone through two major upgrades (quad core upgrade and OS 2.1 upgrade) during the last year. In this paper, we will discuss the various aspects of the user impacts such as user access, user environment, and user issues etc. The performance impacts on the kernel benchmarks and selected application benchmarks will also be presented.
Nicholas Henke, Cray Inc., Automated Lustre Failover on the Cray XT	The ever increasing scale of current XT machines requires increased stability from the critical system services. Lustre failover is being deployed to allow continued operation in the face of some component failures. This paper will discuss the current automation framework and the internal Lustre mechanisms involved during failover. We will also discuss the impact of failover on a system and the future enhancements that will improve Lustre failover.
Lee Higbie, Arctic Region Supercomputing Center (ARSC), HPC Fortran Compilers	A project to evaluate the compilers on ARSC's XT5 revealed some hazards of performance testing. In this study we compared the performance of "highly optimized" code to "default optimization." Some results on optimization were surprising, and we discuss those results and obstacles we faced in gathering our results.
Scott Jackson and Michael Jackson, Cluster Resources, Unifying Heterogeneous Cray Resources and Systems into an Intelligent Single-Scheduled Environment	As Cray systems are expanded and updated with the latest chip sets and technologies (for example, memory and processors), system managers may want to allow users to run jobs across heterogeneous resources to avoid fragmentation. In addition, as next-generation platforms with key differences (such as partition managers like ALPS and CPA) are added, system managers want the ability to submit jobs to the combined system, automatically applying workload to the best-available resources and unifying reporting for managers. This paper will describe how Moab Workload Manager has been integrated with Cray technologies to provide support for running jobs across heterogeneous resources and disparate systems.
Geir Johansen and Barb Mauzy, Cray Inc., Cray XT Programming Environment's Implementation of Dynamic Shared Libraries	This paper will describe the Cray Programming Environment's implementation of dynamic libraries, which will be supported in a future Cray XT CLE release. The Cray implementation will provide flexibility to users to allow specific library versions to be chosen at both link and run time. The implementation will provide support for running executables that were built using the Cray Programming Environment on other platforms. Also, the paper will discuss how executables built with software other than the Cray Programming Environment may be able to be run on the Cray XT.
Mary Johnston, Cray Inc., CrayPort BoF	CrayPort has been evolving since its release. This BoF provides an opportunity for attendees to learn more about CrayPort as well as provide valuable feedback to Cray on the current version. All attendees are welcome to participate in this BoF.
Wayne Joubert, Oak Ridge National Laboratory (ORNL), Performance of Variant Memory Configurations for Cray XT Systems	In late 2009 NICS will upgrade its 8352 socket Cray XT5 from Barcelona (4 cores/socket) processors to Istanbul (6 cores/socket) processors, taking it from a 615 TF machine to nearly 1 PF. To balance the machine and keep 1 GB of memory per core, NICS is interested in reconfiguring the XT5 from its current mix of 8- and 16-GB nodes to a uniform 12 GB/node. This talk examines alternative memory configurations for attaining this, such as balancing the DIMM count between sockets of a node vs. unbalanced configurations. Results of experiments with these configurations are presented, and conclusions are discussed.
Shoaib Kamil, University of California, Berkeley, Cy Chan, Massachusetts Institute of Technology, John Shalf, National Energy Research Scientific Computing Center (NERSC), and Leonid Oliker and Sam Williams, Lawrence Berkeley National Laboratory, A Generalized Framework for Auto-tuning Stencil Computations	This work introduces a generalized framework for automatically tuning stencil computations to achieve optimal performance on a broad range of multicore architectures. Stencil (nearest-neighbor) based kernels constitute the core of many important scientific applications involving block-structured grids. Auto-tuning systems search over optimizations strategies to find the combination of tunable parameters that maximize computational efficiency for a given algorithmic kernel. Although the auto-tuning strategy has been successfully applied to libraries, generalized stencil kernels are not amenable to packaging as libraries. We introduce a generalized stencil auto-tuning framework that takes a straightforward Fortran77 expression of a stencil kernel and automatically generates code tuned implementations of the kernel in Fortran, C, or CUDA to achieve performance portability across a diverse computer architectures that range from conventional AMD multicore processors to the latest NVidia GTX280 GPUs.
James Kasdorf, Pittsburgh Supercomputing Center (PITTSCC), Chair, Legacy Systems SIG	The Legacy System SIG welcomes attendees still utilizing Legacy systems such as XD1 and X1. This meeting provides an opportunity for SIG business followed by open discussions with other attendees as well as representatives from Cray. All attendees are welcome to participate in this meeting.
Ricky Kendall,Don Maxwell, Oak Ridge National Laboratory (ORNL) and Cathy Willis and Jeff Beckleheimer, Cray Inc., Slow Nodes Cost Your Users Valuable Resources: Can You Find Them?	Many HPC applications have a static load balance which is easy and cheap to implement. When one or a few nodes are not performing properly this makes the whole code slow down to the rate limiting performance of the slowest node. We describe the utilization of a code called bugget which has been used on Catamount and the Cray Linux Environment to quickly identify these nodes so they can be removed from the user pool until the next appropriate maintenance period.
Jeff Larkin, Cray Inc., Practical Examples for Efficient I/O on Cray XT Systems	The purpose of this paper is to provide practical examples on how to perform efficient I/O at scale on Cray XT systems. Although this paper will provide some data from recognized benchmarks, it will focus primarily on providing actual code excerpts as specific examples of how to write efficient I/O into an HPC application. This will include explanations of what the example code does and why it is necessary or useful. This paper is intended to educate by example how to perform application checkpointing in an efficient manner at scale.
James Laros, Kevin Pedretti, Sue Kelly, John Vandyke, and Courtenay Vaughan, Sandia National Laboratories (SNLA) and Mark Swan, Cray Inc., Investigating Real Power Usage on Red Storm	This paper will describe the instrumentation of the Red Storm Reliability Availability and Serviceability (RAS) system to enable collection of power draw and instantaneous voltage measurements on a per-socket basis. Additionally, we will outline modifications to the Catamount Light Weight Kernel Operating System which have realized significant power savings during idle periods. We will also discuss what we call Application Power Signatures and future plans for research in this area.
Brent Leback, Douglas Miles, Michael Wolfe, and Steven Nakamoto, The Portland Group, An Accelerator Programming Model for Multicore	PGI has developed a kernels programming model for accelerators, such as GPUs, where a kernel roughly corresponds to a set of compute-intensive parallel loops with rectangular limits. We have designed directives for C and Fortran programs to implement this model, similar in design to the well-known and widely used OpenMP directives. We are currently implementing the directives and programming model in the PGI C and Fortran compilers to target x64+NVIDIA platforms. This paper explores the possibility of implementing the kernels programming model on multicore x64 CPUs as a vehicle for creating algorithms that efficiently exploit SIMD/vector and multicore parallelism on processors that are increasingly memory-bandwidth constrained.
YikLoon Lee, ERDC MSRC (ERDCMSRC), Some Some Issues in the Development of Overset Grid CFD Using One-Sided Communication	Overset (overlapping) grids are very useful for CFD problems with multiple moving bodies. The parallel implementation of the grid connectivity algorithm with a 2-sided communication model like MPI, however, imposes formidable challenge. One of the reasons is that only a few remote data are known to be needed at any one time in the cell search process. A 1-sided model is well suited for this problem. This paper describes the porting and development of a Navy rotorcraft CFD code using Coarray Fortran on the Cray X1.
Hwa-Chun (Wendy) Lin, National Energy Research Scientific Computing Center (NERSC), Understanding Aprun Use Patterns	On the Cray XT, aprun is the command to launch an application to a set of compute nodes reserved through the Application Level Placement Scheduler (ALPS). At the National Energy Research Scientific Computing Center (NERSC), interactive aprun is disabled. That is, invocations of aprun have to go through the batch system. Batch scripts can and often do contain several apruns which either use subsets of the reserved nodes in parallel, or use all reserved nodes in consecutive apruns. In order to better understand how NERSC users run on the XT, it is necessary to associate aprun information with jobs. It is surprisingly more challenging than it sounds. In this paper, we describe those challenges and how we solved them to produce daily per-job reports for completed apruns. We also describe additional uses of the data, e.g. adjusting charging policy accordingly or associating node failures with jobs/users, and plans for enhancements.
Jay Lofstead, Georgia Institute of Technology and Scott Klasky, Karsten Schwan, and Chen Jin, Oak Ridge National Laboratory (ORNL), Petascale I/O Using the Adaptable I/O System	ADIOS, the adaptable I/O system, has demonstrated excellent scalability to 16,000 cores. With the introduction of the XT5 upgrades to Jaguar, new optimizations are required to successfully reach 140,000+ cores. This paper explains the techniques employed and shows the performance levels attained.
Richard Loft, John Dennis, Mariana Vertenstein, and Nathan Hearn, National Center for Atmospheric Research, James Kinter, Center for Ocean Land Atmosphere Studies, and Ben Kirtman, University of Miami,Optimizing High-Resolution Climate Variability Experiments on the Cray XT4 and XT5 Systems at NICS and NCCS	This paper will present XT4 (and hopefully XT5) performance and scaling data for a high resolution (0.5º atmosphere and land surface coupled to 0.1º ocean/sea ice) development version of the Community Climate System Model (CCS) in configurations capable of running efficiently on up to 6380 processors. Technical issues related to tuning the MPI runtime environment, load balancing multiple climate components, and I/O performance will also be discussed.
Don Maxwell, Josh Lothian, Richard Ray, Jason Hill, and David Dillow, Oak Ridge National Laboratory (ORNL) and Cathy Willis and Jeff Becklehimer, Cray Inc., XT9? Integrating and Operating a Conjoined XT4+XT5 System	The National Center for Computational Sciences at Oak Ridge National Laboratory recently acquired a Cray XT5 capable of more than a petaflop in sustained performance. The existing Cray XT4 has been connected to the XT5 to increase the system's computing power to a peak of 1.64 petaflops. The design and implementation of the conjoined system will be discussed. Topics will include networks, Lustre, the Cray software stack, and scheduling with Moab and TORQUE.
Ian Miller, Cray Inc., Cray CX1 Overview	This talk will introduce the technology and capability of the Cray CX1 personal supercomputer, including microprocessor advancements, product roadmap, application results and customer testimonials. In addition, there will be discussion on creating SMP machines using Cray CX1 blades and how the CX1 leverages GPGPU computing. Directly following this presentation will be an interactive BOF hosting a live demonstration of applications running on the Cray CX1.
Ian Miller, Cray Inc., CX1 BoF	This is a continuation of the Cray CX1 Overview session. Details regarding the new CX1 system will be discussed as well as some presentations by Cray. Opportunities for discussions regarding the CX1 will also be available during this BoF.
Richard Mills and Forrest Hoffman, Oak Ridge National Laboratory (ORNL),Coping at the User-Level with Resource Limitations in the Cray Message Passing Toolkit MPI at Scale: How Not to Spend Your Summer Vacation	As the number of processor cores available in Cray XT series computers has rapidly grown, users have increasingly encountered instances where an MPI code that has previously worked for years unexpectedly fails at high core counts ("at scale") due to resource limitations being exceeded within the MPI implementation. Here, we examine several examples drawn from user experiences and discuss strategies for working around these difficulties at the user level.
Richard Mills, Oak Ridge National Laboratory (ORNL), Glenn Hammond, Pacific Northwest National Laboratory (PNNL), Peter Lichtner, Los Alamos National Laboratory, and Barry Smith, Argonne National Laboratory,Experiences and Challenges Scaling PFLOTRAN, a PETSc-based Code for Subsurface Reactive Flow Simulations, Towards the Petascale on Cray XT Systems	We will describe our initial experiences running PFLOTRAN (a code for simulation of coupled hydro-thermal-chemical processes in variably saturated, non-isothermal, porous media) on the petaflop incarnation of Jaguar, the Cray XT5 at Oak Ridge National Laboratory. PFLOTRAN utilizes fully implicit time-stepping and is built on top of the Portable, Extensible Toolkit for Scientific Computation (PETSc). We discuss the hurdles to "at scale" performance with PFLOTRAN and make some observations in general about implicit simulation codes on the XT5.
Hai Ah Nam and David Dean, Oak Ridge National Laboratory (ORNL) and Pieter Maris and James Vary, Iowa State University, Computing Atomic Nuclei on the Cray XT5	Understanding the structure of atomic nuclei is crucial for answering many fundamental questions such as the origin of elements in the universe. The computational challenge for large-scale nuclear structure calculations, particularly ab initio no-core shell model calculations, stems from the diagonalization of the Hamiltonian matrix, with dimensions in the billions, which can be stored in memory, on disk, or recomputed on-the-fly. Here, we discuss the issues of scaling MFDn, a nuclear shell model application, including the I/O demands on the CrayXT5.
Bill Nitzberg, Altair Engineering, Inc., Select, Place, and Vnodes: Exploiting the PBS Professional Architecture on Cray Systems	In 2009, Altair is porting our new PBS Professional workload management and job scheduling architecture to Cray systems. This architecture significantly improves the extensibility, scalability, and reliability of systems by providing two key abstractions: jobs are described in generic “chunks” (e.g., MPI tasks and OpenMP threads), independent of hardware, and resources are described in terms of generic “vnodes” (e.g., blades and/or sockets), independent of jobs. By reducing both jobs and resources to their most basic components, PBS is able to run the right job at the right time in the right place, run it as fast as possible on any hardware, and reduce waste to zero. In this paper, we provide a detailed look at the PBS Professional architecture and how it is mapped onto modern Cray systems.
Oralee Nudson, Craig Stephenson, Don Morton, Lee Higbie, and Tom Logan, Arctic Region Supercomputing Center (ARSC), Benchmarking and Evaluation of the Weather Research and Forecasting (WRF) Model on the Cray XT5	The Weather Research and Forecasting (WRF) model is utilized extensively at ARSC for operational and research purposes. ARSC has developed an ambitious suite of benchmark cases and, for this work, we will present results of scaling evaluations on the Cray XT5 for distributed (MPI) and hybrid (MPI/OpenMP) modes of computation. Additionally, we will report on our attempts to run a 1km-resolution case study with over one billion grid points.
Ronald Oldfield, Andrew Wilson, Craig Ulmer, Todd Kordenbrock, and Alan Scott, Sandia National Laboratories (SNLA), Access to External Resources Using Service-Node Proxies	Partitioning massively parallel supercomputers into service nodes running a full-fledged OS and compute nodes running a lightweight kernel has many well-known advantages but renders it difficult to access externally located resources such as high-performance databases that may only communicate via TCP. We describe an implementation of a proxy service that allows service nodes to act as a relay for SQL requests issued by processes running on the compute nodes. This implementation allows us to move toward using HPC systems for scalable informatics on large data sets that simply cannot be processed on smaller machines.
Mark Pagel, Kim McMahon, and David Knaak, Cray Inc.,Scaling the MPT Software on the CRAY XT5 System and Other New Features	The MPT 3.1 release allowed MPI and SHMEM codes to run on over 150,000 cores and was necessary to help the Cray XT5 at ORNL to achieve over a petaflop in performance on HPL. MPT 3.1 was also used in quickly getting numerous other applications to scale to the full size of the machine. This paper will walk through the latest MPT features including both performance enhancements and functional improvements added over the last year including improvements for MPI_Allgather, MPI_Bcast as well as for MPI-IO collective buffering. New heuristics for better default values for a number of MPI environment variables resulting in less application reruns will also be discussed.
Kevin Peterson, Cray Inc., Scaling Efforts to Reach a PetaFlop	This paper describes the activities leading up to reaching sustained PetaFlop performance on the Jaguar XT5 system fielded at the DOE Leadership Computing Facility in Oak Ridge National Laboratories (ORNL). These activities included software development, scaling emulation, system validation, pre-acceptance application tuning, application benchmarking, and acceptance testing. The paper describes what changes were necessary to the Cray software stack to execute efficiently over 100K cores. Changes to Cray System Management (CSM), Cray Linux Environment (CLE) and Cray Programming Environment (CPE) are described, as well as methodologies used to test the software prior to execution on the target machine.
Kevin Peterson, Cray Inc., Gemini Software Development Using Simulation	This paper describes Cray's pre-hardware development environment, the activities, and the testing for the Gemini software stack. Gemini is the next generation high-speed network for the Cray XT-series. The simulation environment is based on AMD's SimNow coupled with a Gemini device model that can be aggregated to form multi-node systems. Both operating system and programming environment software components have been developed within this environment. The simulated batch environment, regression test suite, and development progress are also described.
Heidi Poxon, Steve Kaufmann, Dean Johnson, Bill Homer, and Luiz DeRose, Cray Inc., Enhanced Productivity Using the Cray Performance Analysis Toolset	The purpose of an application performance analysis tool, is to help the user identify whether or not their application is running efficiently on the computing resources available. However, the scale of current and future high end systems, as well as increasing system software and architecture complexity, brings a new set of challenges to today's performance tools. In order to achieve high performance on these petascale computing systems, users need a new infrastructure for performance analysis that can handle the challenges associated with multiple levels of parallelism, hundreds of thousands of computing elements, and novel programming paradigms that result in the collection of massive sets of performance data. In this paper we present the Cray Performance Analysis Toolset, which is set on an evolutionary path to address the application performance analysis challenges associated with these massive computing systems by highlighting relevant data and by bringing Cray optimization knowledge to a wider set of users.
John Patchett, Dave Pugmire, Sean Ahern, and Daniel Jamison, Oak Ridge National Laboratory (ORNL) James Ahrens, Los Alamos National Laboratory, Parallel Visualization and Analysis with ParaView on the Cray XT4	Scientific data sets produced by modern supercomputers like ORNL's Cray XT4, Jaguar, can be extremely large, making visualization and analysis more difficult as moving large resultant data to dedicated analysis systems can be prohibitively expensive. We share our continuing work of integrating a parallel visualization system, ParaView, on ORNL's Jaguar system and our efforts to enable extreme scale interactive data visualization and analysis. We will discuss porting challenges and present performance numbers.
Challenges and Solutions of Job Scheduling in Large Scale Environments, Christopher Porter, Platform Computing	Many large computational processes running on large systems are embarrassingly parallel – large parallel jobs consisting of many small tasks. Typically, job schedulers only schedule the parallel job, leaving the application to handle task scheduling. To avoid such complexity, some developers leverage standard parallel programming environments like MPI. This presentation will discuss a solution using Platform LSF and Platform LSF Session Scheduler to handle task level scheduling. As a result, application developers can be more focused on the application itself rather than dealing with system and task scheduling issues.
Georg Hager, Erlangen Regional Computing Center (RRZE), Gabriele Jost, Texas Advanced Computing Center (TACC), Rolf Rabenseifner, High Performance Computing Center Stuttgart (HLRS), Communication Characteristics and Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-core SMP Nodes	Hybrid MPI/OpenMP and pure MPI on clusters of multi-core SMP nodes involve several mismatch problems between the parallel programming models and the hardware architectures. Measurements of communication characteristics between cores on the same socket, on the same SMP node, and between SMP nodes on several platforms (including Cray XT4 and XT5) show that machine topology has a significant impact on performance for all parallelization strategies and that topology awareness should be built into all applications in the future. We describe potentials and challenges of the dominant programming models on hierarchically structured hardware. Case studies with the multizone NAS parallel benchmarks on several platforms demonstrate the opportunities of hybrid programming.
Mahesh Rajan, Douglas Doerfler, and Courtenay Vaughan, Sandia National Laboratories (SNLA), Red Storm/Cray XT4: A Superior Architecture for Scalability	The benefits of the XT4 node and interconnect architecture on scalability of applications are analyzed with the help of a micro benchmarks, mini-applications and production applications. Performance comparisons to a large infini-band cluster with multi-socket nodes incorporating a similar AMD processor brings out the importance of architectural balance in the design of HPC systems. This paper attempts to quantify application performance improvements in terms of simple balance ratios.
Michael Ringenburg and Sung-Eun Choi, Cray Inc., Optimizing Loop-level Parallelism in Cray XMT Applications	The Cray XMT, a massively multithreaded shared memory system, exploits fine-grained parallelism to achieve outstanding speed. The system contains a set of software, including a compiler and various performance tools, to help users take advantage of the parallelism opportunities provided by the architecture. Despite the strengths of this software, writing efficient codes for the XMT requires a level of understanding of the architecture and of the capabilities provided by the compiler, as well as a knowledge of how they differ from their more conventional counterparts. This paper will provide a brief overview of the XMT system before diving into the techniques used by the compiler to parallelize user codes. We will then describe how this knowledge can be used to write more efficient and parallelizable codes, and discuss some of the pragmas and language extensions provided by the compiler to assist this process. We will finish with a demonstration, using real application kernels, of how the XMT performance tools can be used to assess how the compiler transformed and parallelized user codes, and to determine what changes may be required to get the codes to run more efficiently.
Jim Rogers, Oak Ridge National Laboratory (ORNL) and Bob Hoehn and Doug Kelley, Cray Inc., Deploying Large Scale XT Systems at ORNL	Jaguar, the DOE leadership system at the ORNL Leadership Computing Facility, and Kraken, the NSF petascale system managed by UT-ORNL's National Institute for Computational Sciences, are two of the largest systems in the world. These systems rely on the Cray ECOphlex(tm) technology to routinely remove more than 10MW of heat. This presentation will describe the installation of these two systems, the unique packaging of the XT5, and how this system design can reduce power and cooling requirements, reduce energy consumption, and lower operating costs.
James Rosinski, Oak Ridge National Laboratory (ORNL), General Purpose Timing Library (GPTL): A Tool for Characterizing Performance of Parallel and Serial Applications	GPTL is an open source profiling library that reports a variety of performance statistics. Target codes may be parallel via threads and/or MPI. The code regions to be profiled can be hand-specified by the user, or GPTL can define them automatically at function-level granularity if the target application is built with an appropriate compiler flag. Output is presented in a hierarchical fashion that preserves parent-child relationships of the profiled regions. If the PAPI library is available, GPTL utilizes it to gather hardware performance counter data. GPTL built with PAPI support is installed on the XT4 and XT5 machines at ORNL.
Philip C. Roth and Jeffrey S. Vetter, Oak Ridge National Laboratory (ORNL), Scalable Tool Infrastructure for the Cray XT Using Tree-Based Overlay Networks	Performance, debugging, and administration tools are critical for the effective use of parallel computing platforms, but traditional tools have failed to overcome several problems that limit their scalability, such as communication between a large number of tool processes and the management and processing of the volume of data generated on a large number of compute nodes. A tree-based overlay network has proven effective for overcoming these challenges. In this paper, we present our experiences in bringing our MRNet tree-based overlay network infrastructure to the Cray XT platform, including a description and preliminary performance evaluation of proof-of-concept tools that use MRNet on the Cray XT.
Jason Schildt, Cray Inc., System Administration Data Under CLE 2.2 and SMW 4.0	With SMW 4.0 and CLE 2.2, Cray is making significant improvements in how system administrators can access information about jobs, nodes, errors, and health / troubleshooting data. This talk and paper will explain the changes and how administrators can use them to make their lives easier.
Michael Schultze, Rebecca Hartman-Baker, Richard Middleton, Michael Hilliard, and Ingrid Busch, Oak Ridge National Laboratory (ORNL), Solution of Mixed-Integer Programming Problems on the XT5	In this paper, we describe our experience with solving difficult mixed-integer linear programming problems (MILPs) on the petaflop Cray XT5 system at the National Center for Computational Sciences at Oak Ridge National Laboratory. We describe the algorithmic, software, and hardware needs for solving MILPs and present the results of using PICO, an open-source, parallel, mixed-integer linear programming solver developed at Sandia National Laboratories, to solve canonical MILPs as well as problems of interest arising from the logistics and supply chain management field.
Galen Shipman, David Dillow, Sarp Oral, and Feiyi Wang, Oak Ridge National Laboratory (ORNL) and John Carrier and Nicholas Henke, Cray Inc., The Spider Center Wide File System: From Concept to Reality	The Leadership Computing Facility at Oak Ridge National Laboratory has a diverse portfolio of computational resources ranging from a petascale XT4/XT5 simulation system (Jaguar) to numerous other systems supporting visualization and data analytics. In order to support the I/O needs of these systems, Spider, a Lustre-based center wide file system, was designed to provide over 240 GB/s of aggregate throughput with over 10 Petabytes of capacity. This paper will detail the overall architecture of the Spider system, challenges in deploying a file system of this scale, and novel solutions to these challenges which offer key insights into file system design in the future.
Thomas C. Schulthess, Director, CSCS-Swiss National Supercomputing Centre (CSCS), The DCA++ Story: How New Algorithms, New Computers, and Innovative Software Design Allow Us to Solve Challenging Simulation Problems in High Temperature Superconductivity	Staggering computational and algorithmic advances in recent years now make possible systematic Quantum Monte Carlo simulations of high temperature superconductivity in a microscopic model, the two dimensional Hubbard model, with parameters relevant to the cuprate materials. Here we report the algorithmic and computational advances that enable us to study the effect of disorder and nano-scale inhomogeneities on the pair-formation and the superconducting transition temperature necessary to understand real materials. The simulation code is written with a generic and extensible approach and is tuned to perform well at scale. Significant algorithmic improvements have been made to make effective use of current supercomputing architectures. By implementing delayed Monte Carlo updates and a mixed single/double precision mode, we are able to dramatically increase the efficiency of the code. On the Cray XT5 system of the Oak Ridge National Laboratory, for example, we currently run production jobs on up to 150 thousand processors that reach a sustained performance of 1.35 PFlop/s.
Timothy Stitt, CSCS–Swiss National Supercomputing Centre (CSCS), A Preliminary Performance Study of CASK-tuned Swiss Application Codes	Cray's Adaptive Sparse Kernel (CASK) Library aims to provide adaptive runtime auto-tuning of application codes containing sparse matrix-vector (SpMV) kernels. In this paper I apply the new CASK library to a collection of PETSC-based Swiss Scientific Application codes and present some preliminary performance results.
Timothy Stitt and Jean-Guillaume Piccinali, CSCS–Swiss National Supercomputing Centre (CSCS), Parallel Performance Optimization and Behavior Analysis Tools: A Comparative Evaluation on the Cray XT Architecture	In the field of high-performance computing (HPC) optimal runtime performance is generally the most desirable trait of an executing application. In this paper we present a comparative evaluation across a set of community parallel performance and behavior analysis tools which have been ported to the Cray XT architecture. Using the vendor-supplied CrayPat and Apprentice tools as our benchmark utilities, we evaluate the remaining tools using a set of objective and subjective measurements ranging from overhead and portability characteristics, through feature comparisons to quality of documentation and “ease-of-use”. We hope such a study will provide an invaluable reference for Cray XT users wishing to identify the best analysis tools which meet their specific requirements.
Olaf Storaasli, Oak Ridge National Laboratory (ORNL) and Dave Strenski, Cray Inc., Exceeding 100X Speedup/FPGA Cray XD1 Timing Analysis Yields Further Gains	Our CUG 2008 paper demonstrated 100X/FPGA speedup on Cray XD1s for very large human DNA sequencing with speedup scalable to 150 FPGAs. This paper examines FPGA time distribution (computations vs. I/O). Testing shows I/O time dominates to such an extent that the actual computation time taken by the FPGAs is almost negligible compared to Opteron computations. The authors demonstrate procedures to significantly reduce I/O, yielding even greater speedup, far exceeding 100X, scalable to 150 FPGAs, for human genome sequencing.
Michael Summers, Oak Ridge National Laboratory (ORNL), DCA++: Winning the Gordon Bell Prize with Generic Programming	The 2009 Gordon Bell prize was won by the DCA++ code which was the first code in history to run at a sustained 1.35 petaflop rate. While many GB prize wining codes are written in FORTRAN, the DCA++ code is fully object-oriented and makes heavy use of generic programming. This paper discusses the programming trade-offs which were used to simultaneously achieve both world class performance and also the maintainability and elegance of modern software practice.
Jason Temple and Fabio Verzelloni, CSCS–Swiss National Supercomputing Centre (CSCS), DVS as a Centralized File System in CLE	Using a centralized filesystem to share home directories between systems is essential for High Performance Computing Centers. In this paper, we will discuss the installation and utilization of DVS in the Cray CLE environment, including current worldwide usage, benchmarks, usability and missing or desired features.
Scott Thornton and Robert Harrison, University of Tennessee and Oak Ridge National Laboratory, Introducing the MADNESS Numerical Framework for Petascale Computing	In preparation for the petaflop Cray XT5, MADNESS was multithreaded using Pthreads and concepts borrowed from the Intel Thread Building Blocks (TBB). MADNESS had been designed for multithreading but Catamount did not support threads that were instead emulated using an event queue. Transitioning to truly concurrent execution greatly simplified the overall implementation but required appropriate use of various mutual exclusion mechanisms (atomic read and increment, spinlock, scalable fair spinlock, mutex, condition variable). Idiosyncrasies of the AMD memory architecture and a non-thread-safe MPI had also to be addressed. We shall discuss the implementation and performance of the multithreaded MADNESS motivating the presentation with reference to an application from solid-state physics that is currently under development.
Stanimire Tomov and Shirley Moore, UTK, Jerzy Bernholc, Jack Dongarra, and Heike Jagode, Oak Ridge National Laboratory (ORNL) and Wenchang Lu, NCSU, Performance Evaluation for Petascale Quantum Simulation Tools	This paper describes our efforts to establish a performance evaluation "methodology'' to be used in a much wider multidisciplinary project on the development of a set of open source petascale quantum simulation tools for nanotechnology applications. The tools to be developed will be based on existing real-space multigrid (RMG) method. In this work we take a reference set of these tools and evaluate their performance using state-of-the-art performance evaluation libraries and tools including PAPI, TAU, KOJAK/Scalasca, and Vampir. The goal is to develop an in-depth understanding of their performance on Teraflop leadership platforms, and moreover identify possible bottlenecks and give suggestions for their removal. The measurements are being done on ORNL's Cray XT4 system (Jaguar) based on quad-core 2.1 GHz AMD Opteron processors. Profiling is being used to identify possible performance bottlenecks and tracing is being used to try to determine the exact locations and causes of those bottlenecks. The results so far indicate that the methodology followed can be used to easily produce and analyze performance data, and that this ability has the potential to aid our overall efforts on developing efficient quantum simulation tools for petascale systems.
Andrew Uselton, National Energy Research Scientific Computing Center (NERSC), Deploying Server-side File System Monitoring at NERSC	NERSC has deployed The Lustre Monitoring Tool (LMT) on the "Franklin" Cray XT, and several months of data are now on file. The data may be used in real time for monitoring system operations and system testing, as well as for incident investigation and historical review. This paper introduces LMT and then presents several examples of insights gained though the use of LMT as well as anomalous behavior that would have otherwise gone unrecognized.
Oreste Villa, Daniel Chavarria-Miranda, Vidhya Gurumoorthi, Andres Marquez, and Sriram Krishnamoorthy, Pacific Northwest National Laboratory (PNNL), Effects of Floating-point Non-associativity on Numerical Computations on Massively Multithreaded Systems	Floating-point addition and multiplication are not necessarily associative. When performing those operations over large numbers of operands with different magnitudes, the order in which individual operations are performed can affect the final result. On massively multithreaded systems, when performing parallel reductions, the non-deterministic nature of numerical operation interleaving can lead to non-deterministic numerical results. We have investigated the effect of this problem on the convergence of a conjugate gradient calculation used as part of a power grid analysis application.
Joni Virtanen, CSC - IT Center for Science Ltd. (CSC), Chair, System Support SIG	The System Support SIG welcomes attendees interested in discussing issues around managing and supporting XT systems of all scales. SIG business will be conducted followed by open discussions with other attendees as well as representatives from Cray. All attendees are welcome to participate in this meeting.
Michael D. Vose, National Institute for Computational Sciences (NICS), Simulating Population Genetics on the XT5	We describe our experience developing custom C code for simulating evolution and speciation dynamics using Kraken, the Cray XT5 system at the National Institute for Computational Sciences. The problem's underlying quadratic complexity was problematic, and the numerical instabilities we faced would either compromise or else severely complicate large-population simulations. We present lessons learned from the computational challenges encountered, and describe how we have dealt with them within the constraints presented by hardware.
Troy Baer, Victor Hazlewood, Junseong Heo, Rick Mohr, John Walsh, University of Tennessee, National Institute for Computational Science (NICS), Large Lustre File System Experiences at NICS	The National Institute for Computational Sciences (NICS), located at the Oak Ridge National Labs, installed a 66K core Cray XT5 in December 2008. A 2.3PB (usable) lustre file system was also configured as part of the XT5 system. NICS would like to present their experiences in configuring, cabling, building, tuning, and administrating such a large file system. Topics to be discussed include, MDS size and speed implications, purge policies, performance tuning, configuration issues, determining the default stripe size and block size, and reliability issues.
Michele Weiland, HPCX Consortium (HPCX) and Thom Haddow, Imperial College London, Performance Evaluation of Chapel's Task Parallel Features	Chapel, Cray's new parallel programming language, is specifically designed to provide built-in support for high-level task and data parallelism. This paper investigates the performance of the task parallel features offered by Chapel, using benchmarks such as N-Queens and Strassen's algorithm, on a range of different architectures, including a multi-core Linux system, an SMP cluster and an MPP. We will also be giving a user's view on Chapel's achievements with regards to the goals set by the HPCS programme, namely programmability, robustness, portability and productivity.
Nathan Wichman, Cray Inc., Early Experience Using the Cray Compiling Environment	In 2008, Cray released its first compiler targeted at the X86 instruction set and over the last several months code developers have begun to test its capabilities. This paper will briefly review the life of the Cray compiler, how to use it, and its current capabilities. We will then present performance numbers, for both some standard benchmarks as well as real applications.
Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, and Katherine Yelick, National Energy Research Scientific Computing Center (NERSC),Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4	We apply auto-tuning to a hybrid MPI-pthreads lattice-Boltzmann computation running on the Cray XT4 at National Energy Research Scientific Computing Center (NERSC) and the XT5 at Oak Ridge National Laboratory. Previous work showed that multicore-specific auto-tuning can significantly improve the performance of lattice-Boltzmann magnetohydrodynamics (LBMHD)—by a factor of 4x when running on dual- and quad-core Opteron dual-socket SMPs. We extend these studies to the distributed memory arena via a hybrid MPI/pthreads implementation. In addition to conventional auto-tuning at the local SMP node, we tune at the message-passing level to determine the optimal aspect ratio as well as the correct balance between MPI tasks and threads per MPI task. Our study presents a detailed performance analysis when moving along isocurves of constant hardware usage: fixed total memory, total cores, and total nodes. Overall, our work points to approaches for improving intra- and inter-node efficiency on large-scale multicore systems for demanding scientific applications.
Patrick Worley, Oak Ridge National Laboratory (ORNL), Early Evaluation of the Cray XT5	We present preliminary performance data for the Cray XT5, comparing with data from the Cray XT4 and the IBM BG/P. The focus is on single node computational benchmarks, basic MPI performance both within and between nodes, and on impact of topology and contention on MPI performance. Example application performance, at scale, will be used to illuminate how the subsystem performance impacts whole system performance.
Xingfu Wu and Valerie Taylor, Texas A&M University, Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems	Chip multiprocessors (CMP) are widely used for high performance computing. While this presents significant new opportunities, such as on-chip high inter-core bandwidth and low inter-core latency, it also presents new challenges in the form of inter-core resource conflict and contention. A major challenge to be addressed is how well current parallel programming paradigms, such as MPI, OpenMP and hybrid, exploit the potential offered by such CMP clusters for scientific applications. In this paper, we use processor partitioning as a term about how many cores per node to use for application execution to analyze and compare the performance of MPI, OpenMP and hybrid parallel applications on two dual- and quad-core Cray XT4 systems, Jaguar with quad-core at Oak Ridge National Laboratory (ORNL) and Franklin with dual-core at the DOE National Energy Research Scientific Computing Center (NERSC). We conduct detailed performance experiments to identify the major application characteristics that affect processor partitioning. The experimental results indicate that processor partitioning can have a significant impact on performance of a parallel scientific application as determined by its communication and memory requirements. We also use the STREAM memory benchmarks and Intel-s MPI benchmarks to explore the performance impact of different application characteristics. The results are then utilized to explain the performance results of processor partitioning using NAS Parallel Benchmarks with Multi-Zone. In addition to using these benchmarks, we also use a flagship SciDAC fusion microturbulence code (hybrid MPI/OpenMP): a 3D particle-in-cell application Gyrokinetic Toroidal Code (GTC) in magnetic fusion to analyze and compare the performance of these MPI, OpenMP and hybrid programs on the dual- and quad-core Cray XT4 systems, and study their scalability on up to 8192 cores. Based on the performance of GTC on up to 8192 cores, we use the Prophesy system to online generate its performance models to predict its performance on more than 10,000 cores on the two Cray XT4 systems.
Zhengji Zhao, National Energy Research Scientific Computing Center (NERSC) and Lin-Wang Wang, Lawrence Berkeley National Laboratory, Applications of the LS3DF Method in CeSe/CdS Core/shell Nano Structures	The Linear Scaling 3 dimensional fragment (LS3DF) code is an O(N) ab initio electronic structure code for large scale nano material simulations. The main idea of this code is a divide-and-conquer method, and the heart of the method is the novel patching scheme that effectively cancels out the artificial boundary effect that exists in all the divide-and-conquer schemes. This method has made the ab intio simulations of the thousands-atom nano systems tractable in terms of the simulation time and yield essentially the same results as the traditional calculation methods. The LS3DF method has won the Gordon Bell Prize in SC 2008 for its algorithmic achievement. We have applied this method to study the internal electric field in the CdS/CdSe core shell nano structure, which has potential applications for the electronic devices and energy conversions.

	Send e-mail to the CUG Office Copyright © Cray User Group, Incorporated
	Web page design and support by Cray User Group (CUG) Conference Services
Back to the 51st CUG 2009 CD page

Abstracts of Papers