HPC Services engineer Michael Jennings gave a talk on the "Node Health Check (NHC)" on Feb 24, 2016 at the Stanford Conference and Exascale Workshop 2016 sponsored by the HPC Advisory Council. NHC, developed by Jennings, provides the framework and implementation for a highly reliable, flexible, extensible node health check solution. It is now widely recommended by major HPC job scheduler vendors and is in use at many large HPC sites and research institutions.
In this follow-up from his 2014 presentation at the Stanford HPC Advisory Council Conference, Michael will provide an update on the latest happenings with the LBNL NHC project, new features in the latest release, and a brief overview of the roadmap for future development.
About Michael Jennings
Michael has been a UNIX/Linux Systems Administrator and a C/Perl developer for 20 years and has been author of or contributor to numerous open source software projects including Eterm, Mezzanine, RPM, Warewulf, and TORQUE. Additionally, he co-founded the Caos Foundation, creators of CentOS, and has been lead developer on 3 separate Linux distributions. He currently works as a Senior HPC Systems Engineer for the High Performance Computing Services group at Lawrence Berkeley National Laboratory and is the primary author/maintainer for the LBNL Node Health Check (NHC) project. He has also served for 2 years as President of SPXXL, the extreme-scale HPC users group.