Authors: Jim Brandt (Sandia National Laboratories), Chris Morrone (Lawrence Livermore National Laboratory), Eric Roman (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Ann Gentile (Sandia National Laboratories), Tom Tucker (Open Grid Computing), Jeff Hanson (HPE), Kathleen Shoga (Lawrence Livermore National Laboratory), Alec Scott (Lawrence Livermore National Laboratory)
Abstract: Over the past decade we have been able to gain new insights into application resource utilization and to detect and diagnose problems with decreased latency through fine-grained monitoring of our HPC systems while incurring no statistically significant performance penalty.
The Cray EX community is exploring a variety of tools for telemetry data acquisition with two major monitoring directions: a) Cray implementation of monitoring using HPCM or CSM and b) customer designed/specified system software, including monitoring. Both include the Lightweight Distributed Metric Service (LDMS) for high-fidelity, high-volume node-level data collection as well as for other features such as dynamically modifiable data collection rates and integration of both synchronous and event-driven data. LDMS is Linux distribution agnostic and is utilized across a variety of OSs in both bare metal and containerized environments.
In this collaboration of HPE and user sites, we explore these two approaches on early-availability platforms at NERSC and LLNL. We seek to ensure that LDMS directions continue to support the intended diversity of approaches and that user contributions to directions and code continue to serve the greater community. Further, we seek to educate sites on configuration, deployment features and scalability requirements for extreme-scale systems and run time analytics.
Long Description: Over the past decade we have been able to gain new insights into application resource utilization and to detect and diagnose problems with decreased latency through fine-grained monitoring of our HPC systems while incurring no statistically significant performance penalty.
The Cray EX community is exploring a variety of tools for telemetry data acquisition with two major monitoring directions: a) Cray implementation of monitoring using HPCM or CSM and b) customer designed/specified system software, including monitoring. Both include the Lightweight Distributed Metric Service (LDMS) for high-fidelity, high-volume node-level data collection as well as for other features such as dynamically modifiable data collection rates and integration of both synchronous and event-driven data. LDMS is Linux distribution agnostic and is utilized across a variety of OSs in both bare metal and containerized environments.
In this collaboration of HPE and user sites, we explore these two approaches on early-availability platforms at NERSC and LLNL. We seek to ensure that LDMS directions continue to support the intended diversity of approaches and that user contributions to directions and code continue to serve the greater community. Further, we seek to educate sites on configuration, deployment features and scalability requirements for extreme-scale systems and run time analytics.
Paper: PDF