CUG Logo

Papers

Deploying Cloud-Native HPC Clusters on HPE Cray EX

Authors: Felipe A. Cruz (Swiss National Supercomputing Centre), Alejandro J. Dabin (Swiss National Supercomputing Centre), Manuel Sopena Ballesteros (Swiss National Supercomputing Centre)

Abstract: The software stack that manages a High-Performance Computing (HPC) cluster is a collection of applications and services put together by multiple engineers. Integrating all the software components is often complex and challenging. Therefore, the engineering effort frequently focuses on minimizing service disruption over the value delivery of new features. In this work, we introduce a cloud-native architecture for delivering an HPC cluster on top of HPE Cray EX that streamlines the development, operation, maintenance, and administration of the many services that compose an HPC cluster. Under a cloud-native approach, an HPC cluster is architected as a collection of small, loosely coupled services that can be independently delivered. Moreover, we leverage an on-prem cloud platform deployment that enables a self-service model for engineers to introduce controlled changes to the cluster while streamlining service and infrastructure automation. The presented cloud-native architecture is a starting point for delivering HPC clusters that are more resilient and scalable to operate.

Long Description: The software stack that manages a High-Performance Computing (HPC) cluster is a collection of applications and services put together by multiple engineers. Integrating all the software components is often complex and challenging. Therefore, the engineering effort frequently focuses on minimizing service disruption over the value delivery of new features. In this work, we introduce a cloud-native architecture for delivering an HPC cluster on top of HPE Cray EX that streamlines the development, operation, maintenance, and administration of the many services that compose an HPC cluster. Under a cloud-native approach, an HPC cluster is architected as a collection of small, loosely coupled services that can be independently delivered. Moreover, we leverage an on-prem cloud platform deployment that enables a self-service model for engineers to introduce controlled changes to the cluster while streamlining service and infrastructure automation. The presented cloud-native architecture is a starting point for delivering HPC clusters that are more resilient and scalable to operate.

Paper: PDF



Back to Papers Archive Listing