Troy Baer
troy@osc.edu
http://www.osc.edu/~troy/
The 64-bit Intel Itanium processor is an ambitious new entry into the arena of high performance computing, and the Ohio Supercomputer Center (OSC) has been at the forefront of HPC centers investigating and adopting this new technology. OSC installed the first cluster of production Itanium systems, based on the SGI 750 workstation, in July 2001. This system quickly became popular with users and is one of the most heavily used systems at OSC.
While used primarily as a production computing resource, OSC's Itanium cluster also serves as a research testbed for high performance computing technology. One area of the research capacity of the system is that of parallel input/output techniques. This is an area of active research in the Linux cluster community; as cluster systems become larger and more capable, the need for high capacity and high performance storage increases. This paper describes efforts to benchmark the I/O performance of the Itanium cluster with respect to locally attached storage, shared storage using NFS, and shared storage using a parallel file system.
OSC's Itanium cluster consists of a total of seventy-three SGI 750 systems running Linux. Each system contains two Itanium processors running at 733 MHz with 2 megabytes of L3 cache, 4 gigabytes of main memory, an 18 gigabyte SCSI system disk, an onboard 100 megabit Ethernet interface, and PCI-based Myrinet 2000 and Gigabit Ethernet interfaces. One system was designated as a "front end" node and file server and equipped with additional disks and Ethernet interfaces, while the other seventy-two systems act strictly as computational nodes, scheduled using PBS and Maui scheduler. Each computational node has approximately 5.3 gigabytes of space available on its local /tmp filesystem.
The presence of three network interfaces on each system is not accidental; indeed, it reflects part of OSC's cluster design philsophy, that administration, message passing, and I/O should each have its own network when possible. The 100 megabit Ethernet network is used for administration and low performance (NFS) file traffic, while the Myrinet network is dedicated to MPI message passing traffic. The Gigabit Ethernet network has two primary uses: high-performance file I/O traffic and research into implementing MPI at the network interface level [1]. This paper will focus on the high performance I/O application of the Gigabit Ethernet network.
OSC's Gigabit Ethernet infrastructure is based around an Extreme Networks Black Diamond 6816 switch. All of OSC's HPC systems have at least one Gigabit Ethernet interface connected to this switch. In the case of OSC's HPC cluster systems, the front end nodes have direct Gigabit Ethernet connections to the switch, while compute nodes have 100 megabit Ethernet connections to one of several private switches internal to the cluster, each of which have Gigabit Ethernet uplinks to the Black Diamond switch. The Gigabit Ethernet interfaces on the Itanium cluster nodes are also directly connected to the Black Diamond switch and configured to use jumbo (9000 byte) frames instad of standard Ethernet frames. A separate HIPPI network is also used by several of the HPC systems for file service.
The primary mass storage server attached to this network is an
Origin 2000 system, nicknamed mss
. This eight
processor system supports approximately one terabyte of Fibre
Channel disk storage and an IBM 3494 tape robot containing roughly
sixty terabytes of tape storage, managed by SGI's Data Migration
Facility (DMF) software. The mss
system is responsible
for home directory and archival file service to OSC's various HPC
systems, using NFS over the Gigabit Ethernet and HIPPI networks.
Also connected to the Gigabit Ethernet infrastructure is a sixteen node cluster dedicated to parallel file system research. Each node in this cluster contains two Intel Pentium III processors running at 933 MHz, a gigabyte of RAM, an onboard 100 megabit Ethernet interface, a PCI-based Gigabit Ethernet interface which supports jumbo frames, and a 3ware 7810 EIDE RAID controller driving eight eighty-gigabyte drives in a RAID 5 configuration. Using the PVFS software developed by Clemson University [2,3], the disk space on these systems is presented to client systems as a unified parallel file system with an aggregate capacity of 7.83 terabytes and a peak network transfer rate of 2 gigabytes per second. The PVFS file system can be accessed by parallel applications using the ROMIO implementation of the parallel I/O chapter of the MPI-2 specification [4] or by serial applications using a user-space VFS driver for the Linux kernel. Similar VFS drivers for other systems from vendors such as SGI and Cray are currently under investigation.
bonnie
bonnie
[5] is a widely used UNIX file
system benchmark. It tests performance using the POSIX low level
I/O functions read()
and write()
, as well
as stream oriented character I/O using getc()
and
putc()
. As a result, bonnie
is a good
indicator of the performance characteristics of sequential
applications.
Unfortunately, bonnie
also has two problems with
respect to large, modern high performance systems. First, it uses a
C int
to store the size of the file it uses for
testing, which limits it to a maximum file size of two gigabytes on
systems on which an int
is 32 bits, including Linux on
the Itanium. Second, it uses an 8kB buffer for read()
and write()
operations, which is acceptable for file
systems that have block sizes smaller than 8kB but disastrous for
file systems like PVFS that have much larger block sizes. To
compensate for these problems, the author developed a variation of
the code called bigbonnie,
which can handle files larger than two gigabytes and uses 64kB
buffers.
bonnie
Results
---------Sequential Output-------- ---Sequential Input--- ----Random--- -Per Char- ---Block--- -Rewrite-- -Per Char- ---Block--- ----Seeks---- Filesys MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU /tmp 2000 6099 99.2 262477 99.9 327502 99.9 3091 99.9 552000 99.9 90181.5 193.7 /home 2000 147 3.7 1562 2.3 859 1.7 1909 62.7 542824 99.9 805.5 3.0 /pvfs 2000 3909 63.5 199 1.1 201 2.4 2816 92.9 6614 36.3 227.2 12.1 |
bigbonnie
Results
-------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Filesys MB MB/s %CPU MB/s %CPU MB/s %CPU MB/s %CPU MB/s %CPU /sec %CPU /tmp 4096 5.9 99.6 24.6 10.1 17.2 7.1 3.9 96.6 34.2 9.7 452.9 6.8 /home 8192 0.1 3.2 1.5 2.0 0.8 1.4 1.7 42.3 3.0 1.8 74.6 3.5 /pvfs 8192 3.8 63.3 11.0 7.3 16.0 25.7 3.4 91.2 32.8 25.6 583.3 30.1 |
Figures 1 and 2 show the results of bonnie
and
bigbonnie
when used on three file systems:
/tmp on a node's local disk, /home mounted via NFS
from the mass storage server, and /pvfs mounted using the
Linux user-space PVFS driver from the parallel file system cluster.
The results with original bonnie
are problematic,
because the two gigabyte file size limit means that the largest file
which the program can create is smaller than the main memory
available on a node. This is why the original bonnie
shows bandwidths in the hundreds of megabytes per second to
/tmp; the file operations are being cached entirely in main
memory, so the bandwidths observed are more in line with the
performance of memory than the SCSI disk where /tmp
physically resides. With the larger file sizes available to
bigbonnie
, these cache effects are greatly reduced.
Another noteworthy difference between the bonnie
and
bigbonnie
results is the much higher performance of
bigbonnie
for block writes and rewrites, especially
with respect to the PVFS file system. This is caused by
bigbonnie
's use of 64 kB records; in the case of the
original bonnie
, 8 kB records are used for block reads
and writes, and this size turns out to be a pathological case for
PVFS.
Given the poor performance of NFS demonstrated here and the limited size of the /tmp file systems on the SGI 750 compute nodes, all further benchmarks were only run against the PVFS file system.
perf
The ROMIO source code includes an example MPI-IO program called
perf
. In this program, each MPI process has a fixed
size data array, four megabytes by default, which is written using
MPI_File_write()
and read using
MPI_File_read()
to a fixed location equal to the
process' rank times the size of the data array. There are two
variations to each read and write test: one in which the time for a
call to MPI_File_sync()
is included in the timings
(before read operations and after write operations), and one in
which it is not. These tests provide an upper bound on the MPI-IO
performance that can be expected from a given computer and file
system.
perf
Read with Sync Performance
perf
Write with Sync Performance
perf
Write without Sync Performance
Figures 3 through 5 show the aggregate bandwidth reported by
perf
using array sizes from 2 MB to 1 GB for the cases
of reads preceded by MPI_File_sync()
, writes followed
by MPI_File_sync()
, and writes without a sync
respectively. The cases of read and write-without-sync both show
peak bandwidths of approximately 1.5 GB/s, while the case of
write-with-sync peaks at 870 MB/s. All three cases show their best
performance with very large array sizes; in the case of
write-with-sync, array sizes of 32 MB and below rarely exceed 100 to
150 MB/s, even for large process counts, but array sizes of 64 MB and
above scale favorably. All the curves shown also exhibit drops in
performance at process counts of 24 and 48; this is thought to be
caused by saturation of the Gigabit Ethernet network. The Itanium
compute nodes are connected to 12-port Gigabit Ethernet line cards
in the Extreme Networks 6816, and these line cards are known to have
a maximum bandwidth to the switch backplane of only 4 Gbit/s,
whereas the 8-port line cards to which the I/O nodes are connected
have a maximum bandwidth to the backplane of 8 Gbit/s. In cases
where the 12-port line cards are heavily utilized, performance is
somewhat degraded.
btio
The btio
benchmarks are variations on the well known
bt
application benchmark from the NAS Parallel
Benchmark suite, developed at NASA Ames Research Center [6]. The base bt
application solves
systems of block-tridiagonal equations in parallel; the
btio
benchmarks add several different methods of doing
periodic solution checkpointing in parallel, including Fortran
direct unformatted I/O, MPI-IO using several calls to
MPI_File_write_at()
(known as the "simple" MPI-IO
version), and MPI-IO using an MPI data type and a single call to
MPI_File_write_at()
(the "full" MPI-IO version).
btio
Performance
Figure 6 shows the performance observed using class A
(623 volume) and class B (1023 volume)
configurations of the full MPI-IO variant of btio
which
has been instrumented to measure I/O bandwidth. The class A
configurations averaged about 40 MB/s, while the class B
configurations averaged about 100 MB/s. Class C (1623
volume) configurations would not run due to a bug in the ROMIO code
with the version of MPICH/ch_gm in use on the system. Both class A
and class B configurations show significant drops in performance at
odd process counts however, with the worst being observed at 81
processes.
ASCI Flash is a parallel application that simulates astrophysical thermonuclear flashes developed at the University of Chicago and used in the U.S. Department of Energy Accelerated Strategic Computing Initiative (ASCI) program [7]. It uses the MPI-IO parallel interface to the HDF5 data storage library to store its output data, which consists of checkpoints as well as two types of plottable data. The I/O portion of this application is sufficiently large that the developers have separated it into a separate code, known as the ASCI Flash I/O benchmark, for testing and tuning purposes [8]. The ASCI Flash I/O benchmark has been run on a number of platforms, including several of the ASCI teraflop-scale systems as well as the Chiba City cluster at Argonne National Laboratory [9], which, like OSC's Itanium cluster, uses PVFS for its parallel file system.
Figure 7 shows the performance of the ASCI Flash I/O benchmark on OSC's Itanium cluster. As seen in Argonne's studies with Flash I/O to PVFS on Chiba City, the total I/O time is dominated by data type processing. Flash I/O stores all of the data for a given point together in its output files, while in memory it stores adjacent values of the same quality together; the overhead associated with the rearrangement from the way data is layed out in memory to how it is layed out in the files has been observed to be quite large. The results seen on OSC's Itanium cluster are quite competitive with the ASCI systems; only the IBM SP system "Frost" at Lawrence Livermore National Laboratory demonstrates higher performance, and then only at process counts of 256 and above [8].
The Science and Technology Support group at OSC has developed a
series of Fortran codes that solve Laplace's equation on a
two-dimensional grid of fixed size. These codes are used primarily
as examples in teaching concepts of application performance tuning
and parallel computing approaches, such as vectorization [10], OpenMP, MPI, and hybrid OpenMP/MPI [11]. A natural extension of an MPI version of these
codes was to add MPI-IO calls to write a checkpoint file, which was
done in support of a workshop offered by OSC on parallel I/O
techniques [12]. Several variations of this code
exist; the best performing of them uses a call to
MPI_File_write_all()
after seeking to separate regions
of the file to accomplish its output.
Figure 8 shows the I/O performance of the MPI Laplace solver application on an 8001x8001 grid (approximately 0.5 GB) for two cases, one using buffering of collective I/O and one without collective buffering. Collective buffering is a parallel I/O optimization in which the data being written is reorganized into contiguous segments before it is set to the I/O subsystem, but in this case the data being written is already contiguous and hence collective buffering is unnecessary overhead. The bandwidth exhibited by this application does not scale well beyond 16 processors; this is because as the process count increases, the amount of data being written by an individual process drops because of the fixed grid size. This means that the performance drops from following the 512 MB curve from Figure 5 (in the case of 1-2 processes) to following the 8 MB curve from the same figure (in the case of 64 processes).
These benchmarks demonstrate that PVFS on Itanium-based systems can
perform very well for applications using MPI-IO, and roughly the
same as a local disk for applications using conventional POSIX I/O
when writing large blocks of data. Unfortunately, while it is a
good candidate for high performance temporary storage, PVFS is not
necessarily a good general purpose file system. Its RAID-0 like
nature means that the catastrophical failure of an I/O node can
destroy the file system; however, this can be mitigated to some
extent by having onsite spares, and work is ongoing to add RAID-10
like behavior at both the file system and file level. Also,
metadata operations such as fstat()
on PVFS are very
expensive, because the PVFS metadata manager daemon must communicate
with each I/O daemon.
OSC has further plans for investigating parallel I/O on Linux
clusters on several fronts. First among these is to connect the
parallel file system to the center's Athlon-based IA32 cluster,
using either 100Mbit Ethernet or possibly Myrinet 2000 as the I/O
network. Also, there are a few more parallel I/O benchmarks which
should be run against the parallel file system, such as the
effective I/O bandwidth benchmark b_eff_io
[13] and the LLNL Scalable I/O Project's
ior_mpiio
[14]. Also, the effect of
collective buffering on the "full" version of the NAS
btio
benchmark needs to be investigated, as collective
buffering could improve its performance significantly.
http://www.textuality.com/bonnie/
. 2000.
http://parallel.nas.nasa.gov/MPI-IO/btio
. NASA
Ames Research Center, 1996.
http://flash.uchicago.edu/info/info.html
.
University of Chicago, 2001.
http://flash.uchicago.edu/~zingale/flash_benchmark_io/
.
University of Chicago, 2001.
http://oscinfo.osc.edu/training/perftunvec/
.
OSC, 2001.
http://oscinfo.osc.edu/training/multi/
. OSC,
2000.
http://oscinfo.osc.edu/training/pario/
. OSC,
Columbus, OH, 2002.
http://www.llnl.gov/asci/purple/benchmarks/limited/ior/ior.mpiio.readme.html
.
Lawrence Livermore National Laboratory, Livermore, CA, 2001.
Troy Baer received his M.S. in aerospace engineering in 1998. Since then, he has worked in the Science and Technology Support group at the Ohio Supercomputer Center (OSC), where he has developed numerous workshop courses on topics related to high performance computing. He has also been involved in OSC's cluster development group. His research interests include parallel I/O, scheduling for cluster systems, and performance analysis tools.