Clustrx Watch monitors HPC clusters

 

Clustrx Watch monitors HPC clusters

Clustrx Watch

Clustrx Watch is a innovate cluster-wide monitoring system that is capable of monitoring millions of indicator sources on an high-performance (HPC) cluster in near-real time, while scaling linearly. It features a hierarchical architecture of service data collection, aggregation, distribution, processing and logging that has been engineered to serve multi-petascale and future exascale systems.

The system goes beyond mere monitoring and forms a controlling backbone for an HPC cluster, providing sophisticated decision-making logic and integration with whatever management subsystems exist on the cluster. It can work both as part of the Clustrx operating system or separately, as a system application over various HPC installations. Clustrx Watch is implemented in Erlang/OTP and inherits Erlang’s best features of scalability and reliability.

Guarding the safety

The primary task of Clustrx Watch is to monitor the health of a whole HPC cluster and make automated decisions to save the hardware from physical damage due to any kind of failure. The system controls an extensive set of hardware and software resources that keep up power supply, cooling, and network infrastructure to ensure both the physical safety of the cluster’s devices and the successful completion of computing jobs. Self-diagnostics is included.

To allow the monitoring of their safety, compute nodes run software agents. The agents are designed to minimise the processor time consumption. Where SNMP (Simple Network Management Protocol) is supported, the agents are not required. The supported hardware list is extendable without limit. The performance of operating systems on compute nodes is subject to monitoring via various metrics as well.

Integrating heterogeneous data

Each monitoring checkpoint (an agent, an SNMP source) gives hundreds of indications per second on a single node. The data flow to decision-making management nodes travels via an hierarchy of transit nodes which do intermediate processing. The data collected are used to perform an analysis and make critical decisions, such as the emergency shutdown of an overheated node, reboot, hibernation, etc. The processed monitoring data are stored and can be used for further statistical analysis.

This flow of information can be very heterogeneous and come from a great variety of physical or virtual devices. Clustrx Watch is capable of collecting, processing, passing through, ordering, storing and analysing nearly any kind of data flow produced by HPC, to generate a clean and comprehensible output for human administrators or analysts, or for further automated analysis and processing.

Sophisticated logic and control

Clustrx Watch is based on complex business logic and has access to a rule-based engine that facilitates automated response to events occurring on the cluster. Power and other resource consumption policies are defined by modifiable rules. Clustrx Watch can cooperate with any resource management subsystem to jointly resolve any issue such as the emergency shutdown of a piece of hardware or relocation of a compute job, using and updating a store of information in the process.

In addition to automated response tools, the administrator has extensive manual control over the monitoring and its data: scripting, the access to information at the level of any transit node, statistics, etc. Command-line and graphical tools are available to administer Clustrx Watch from a single vantage point.

Usage scenarios

Due to the modular and expandable architecture of Clustrx Watch it is possible to use it as a monitoring system for non-HPC purposes, including monitoring of data centres, industrial installations and so on. Open API enables third-party developers to develop their own logic of proactive system management, which can be installed on Clustrx Watch.

Scalability

Clustrx Watch was built to be linearly scalable up to any size. The current version was tested on a configuration model of 50,000 computing nodes, with 300 parameters per second being collected from each node.

Data: 

Products details

Applications

  • Can add an integral monitoring/supervision layer over almost any HPC installation.
  • Makes a useful administrator tool to keep a running cluster safe.
  • Flexible collection of statistics to analyse performance and resource usage.

Used by

  • Moscow University, on two HPC clusters, including one that ranks 13th on the Top500 Supercomputer Sites list