CUG Logo

Papers

Supporting Many Task Workloads on Frontier using PMIx and PRRTE

Authors: Wael Elwasif (Oak Ridge National Laboratory), Thomas Naughton (Oak Ridge National Laboratory)

Abstract: Large scale many-tasks ensemble are increasingly used as the basic building block for scientific applications running on leadership class platforms. Workflow engines are used to coordinate the execution of such ensembles, and make use of lower level system software to manage the lifetime of processes. PMIx (Process Management Interface for Exascale) is a standard for interaction with system resource and task management system software. The OpenPMIx reference implementation provides a useful basis for workflows engines running on large scale HPC systems.

In this paper we present early experience using PRRTE/PMIx to manage the execution of many-tasks ensemble workloads on HPE Cray XE systems, namely the early access system \emph{Crusher} at OLCF as well as the Frontier Exascale system. We outline important considerations when using the platform for achieving performance and highlight this alternative approach for user-driven task sub-scheduling (i.e., task scheduling within an existing job allocation). We report results from experiments run on Crusher and the Frontier Exascale system based on a synthetic many-task workload.


Long Description: Large scale many-tasks ensemble are increasingly used as the basic building block for scientific applications running on leadership class platforms. Workflow engines are used to coordinate the execution of such ensembles, and make use of lower level system software to manage the lifetime of processes. PMIx (Process Management Interface for Exascale) is a standard for interaction with system resource and task management system software. The OpenPMIx reference implementation provides a useful basis for workflows engines running on large scale HPC systems.

In this paper we present early experience using PRRTE/PMIx to manage the execution of many-tasks ensemble workloads on HPE Cray XE systems, namely the early access system \emph{Crusher} at OLCF as well as the Frontier Exascale system. We outline important considerations when using the platform for achieving performance and highlight this alternative approach for user-driven task sub-scheduling (i.e., task scheduling within an existing job allocation). We report results from experiments run on Crusher and the Frontier Exascale system based on a synthetic many-task workload.


Paper: PDF



Back to Papers Archive Listing