Authors: Michael Moore (Hewlett Packard Enterprise), Ashwin Reghunandanan (Hewlett Packard Enterprise), Lisa Gerhardt (Lawrence Berkeley National Laboratory)
Abstract: HPC I/O workloads using shared file access on distributed file systems such as Lustre have historically achieved lower performance relative to an optimal file-per-process workload. Optimizations at different levels of the application and file system stacks have alleviated many of the performance limitations for disk-based Lustre storage targets (OSTs). While many of the shared file optimizations in Lustre and MPI-IO provide performance benefits on NVMe based OSTs the existing optimizations don’t allow full utilization of the high throughput and random-access performance characteristics of the NVMe OSTs on existing systems. A new optimization in HPE Cray MPI, part of the HPE Cray Programming Environment, builds on existing shared file optimizations and the performance characteristics of NVMe-backed OSTs to improve shared file write performance for those targets. This paper discusses the motivation and implementation of that new shared file write optimization, MPI-IO Local Aggregation as Collective Buffering, for NVMe based Lustre OSTs like those in the HPE Cray ClusterStor E1000 storage system. This paper describes the new feature and how to evaluate application MPI-IO collective operation performance through HPE Cray MPI MPI-IO statistics. Finally, results of benchmarks using the new collective MPI-IO write optimization are presented.
Long Description: HPC I/O workloads using shared file access on distributed file systems such as Lustre have historically achieved lower performance relative to an optimal file-per-process workload. Optimizations at different levels of the application and file system stacks have alleviated many of the performance limitations for disk-based Lustre storage targets (OSTs). While many of the shared file optimizations in Lustre and MPI-IO provide performance benefits on NVMe based OSTs the existing optimizations don’t allow full utilization of the high throughput and random-access performance characteristics of the NVMe OSTs on existing systems. A new optimization in HPE Cray MPI, part of the HPE Cray Programming Environment, builds on existing shared file optimizations and the performance characteristics of NVMe-backed OSTs to improve shared file write performance for those targets. This paper discusses the motivation and implementation of that new shared file write optimization, MPI-IO Local Aggregation as Collective Buffering, for NVMe based Lustre OSTs like those in the HPE Cray ClusterStor E1000 storage system. This paper describes the new feature and how to evaluate application MPI-IO collective operation performance through HPE Cray MPI MPI-IO statistics. Finally, results of benchmarks using the new collective MPI-IO write optimization are presented.
Paper: PDF