Authors: Matthew A. Ezell (Oak Ridge National Laboratory)
Abstract: The HPE Cray EX235a compute blade that powers Frontier packs significant computational power in a small form factor. These complex nodes contain one CPU, 4 AMD GPUs (which present as 8 devices), 4 SlingShot NICs, and 2 NVMe devices. During the process of Frontier’s bring-up, as HPE and ORNL staff observed issues on nodes they would develop a health check to automatically detect the problem. A simple bash script called checknode collected these tests into one central location ensure that each component in the node is working according to its specifications.
ORNL developed procedures that ensure checknode is run before allowing nodes to be used by the workload manager. The full checknode script runs on boot before Slurm starts, and a reduced set of tests run during the epilog of every Slurm job. Errors detected by checknode will cause the node to be marked as “drain” in Slurm with the error message stored in the Slurm “reason” field. Upon a healthy run of checknode, it can automatically undrain/resume a node as long as the “reason” was set by checknode itself.
This presentation will discuss some of the checks present in checknode as well as outline the node state management workflow.
Long Description: The HPE Cray EX235a compute blade that powers Frontier packs significant computational power in a small form factor. These complex nodes contain one CPU, 4 AMD GPUs (which present as 8 devices), 4 SlingShot NICs, and 2 NVMe devices. During the process of Frontier’s bring-up, as HPE and ORNL staff observed issues on nodes they would develop a health check to automatically detect the problem. A simple bash script called checknode collected these tests into one central location ensure that each component in the node is working according to its specifications.
ORNL developed procedures that ensure checknode is run before allowing nodes to be used by the workload manager. The full checknode script runs on boot before Slurm starts, and a reduced set of tests run during the epilog of every Slurm job. Errors detected by checknode will cause the node to be marked as “drain” in Slurm with the error message stored in the Slurm “reason” field. Upon a healthy run of checknode, it can automatically undrain/resume a node as long as the “reason” was set by checknode itself.
This presentation will discuss some of the checks present in checknode as well as outline the node state management workflow.
Paper: PDF