CUG Logo

Papers

Software-defined Multi-tenancy on HPE Cray EX Supercomputers

Authors: Richard Duckworth (HPE), Vinay Gavirangaswamy (HPE), David Gloe (HPE), Brad Klein (HPE)

Abstract: Sandia National Laboratory’s Red Storm System was designed to support “switching” hardware to isolate computation and data between data classification levels. This enabled Sandia and derivative system architectures to adapt investments in capability computing to evolving needs. Today, industry demand for multi-tenancy in modern converged HPC and AI platforms has not waned, but expectations around how the solution should be delivered have changed – as have the types of workloads being run. The industry is now strongly advocating for and investing in cloud-like platforms that treat multi-tenancy as a first-principles capability, align with modern DevOps management techniques, support resource elasticity, and enable customers to deliver their own IaaS, PaaS, and SaaS solutions. Enter HPE Cray Systems Management (CSM). CSM is a Kubernetes-based, turnkey, open source, API-driven HPC systems software solution. Using CSM as a foundation, we have developed a software-defined, multi-tenancy architecture, anchored by a tenancy “controller hub,” called the Tenant and Partition Management System (TAPMS). TAPMS, through extant features in CSM inherits availability, scale, resiliency, disaster recovery, and security properties of the platform. This paper presents TAPMS, the supporting architecture, and the resulting composable, declarative tenant configuration interfaces that TAPMS and the underlying Kubernetes Operator Pattern enable.

Long Description: Sandia National Laboratory’s Red Storm System was designed to support “switching” hardware to isolate computation and data between data classification levels. This enabled Sandia and derivative system architectures to adapt investments in capability computing to evolving needs. Today, industry demand for multi-tenancy in modern converged HPC and AI platforms has not waned, but expectations around how the solution should be delivered have changed – as have the types of workloads being run. The industry is now strongly advocating for and investing in cloud-like platforms that treat multi-tenancy as a first-principles capability, align with modern DevOps management techniques, support resource elasticity, and enable customers to deliver their own IaaS, PaaS, and SaaS solutions. Enter HPE Cray Systems Management (CSM). CSM is a Kubernetes-based, turnkey, open source, API-driven HPC systems software solution. Using CSM as a foundation, we have developed a software-defined, multi-tenancy architecture, anchored by a tenancy “controller hub,” called the Tenant and Partition Management System (TAPMS). TAPMS, through extant features in CSM inherits availability, scale, resiliency, disaster recovery, and security properties of the platform. This paper presents TAPMS, the supporting architecture, and the resulting composable, declarative tenant configuration interfaces that TAPMS and the underlying Kubernetes Operator Pattern enable.

Paper: PDF



Back to Papers Archive Listing