CUG2020 Proceedings

Monday, November 9th

7:00am-8:00am

New Special Paper Session

Main

I/O Performance Characterization and Prediction through Machine Learning on HPC Systems

I/O Performance Characterization and Prediction through Machine Learning on HPC Systems

Lipeng Wan, Matthew Wolf, Feiyi Wang, Jong Youl Choi, George Ostrouchov, Jieyang Chen, Norbert Podhorszki, Jeremy Logan, Kshitij Mehta, Scott Klasky, and Dave Pugmire (Oak Ridge National Laboratory)

Abstract

When scientists run their applications on high-performance computing (HPC) systems, they often experience highly variable runtime I/O performance, and sometimes unexpected I/O performance degradations can dramatically slow down the applications' execution. This issue is mainly caused by I/O bandwidth contention, since the storage subsystem of HPC systems is usually shared by many concurrently running applications and the I/O performance of each application might be affected by I/O traffic from others. In order to mitigate the I/O bandwidth contention, scientific applications running on HPC systems need to schedule their I/O operations in a proactive and intelligent manner, which necessitates the capability of predicting the near-future runtime I/O performance. However, the runtime I/O performance prediction in production HPC environments is extremely challenging, as the storage subsystems are complex and the I/O operations of those running applications might have irregular patterns.

In this paper, we formulate the I/O performance prediction on production HPC systems as a classification problem and exploit a range of machine learning techniques to address it. Our I/O prediction model is lightweight and its effectiveness is validated using real performance traces collected from two Cray supercomputer systems. Our results show that transitions between different I/O performance states can be predicted by our model with high average accuracy (75% when SVM is used). Moreover, the prediction is robust even when only limited training data is available. We also study and demonstrate how to leverage the prediction results to improve the efficiency of I/O bandwidth utilization on these two Cray supercomputer systems.

pdf

Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Unified Model Global Weather Forecast Performance on HPE Cray EX

Utilizing Ensemble Learning for Performance and Power Modeling and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks

HPE Cray Supercomputers: System User Access; User Access Node or User Access Instance, Which is Right for Me?

Early User Experience on and Lessons Learned from the NERSC Cori GPU Cluster

A scheduling policy to improve 10% of communication time in parallel FFT

Deriving Workload Expectations: Monitoring and Analysis Using HPC Job Profiles

Towards Acceptance Testing at the Exascale Frontier

Advanced Topics in Configuration Management

Enabling Power Measurement and Control on Astra: The First Petascale Arm Supercomputer

Not All Applications Have Boring Communication Patterns: Profiling Message Matching with BMM

Special Paper