Clustrx OS makes scalable high performance computing easy

 

Clustrx OS makes scalable high performance computing easy

Clustrx®

Clustrx is an operating system (OS) for high- performance computing (HPC) that takes up several innovative approaches. The Clustrx architecture is designed to ensure manageability of clusters of any size and type, thereby addressing petascale and upcoming exascale challenges. It is a completely new integrated solution, developed from the ground up and unrelated to any HPC stacks.

Clustrx supports all levels of a cluster’s infrastructure, from bootable operating system images on compute nodes to user/administrator interfaces fully abstracted from lower levels of architecture. Most system components of Clustrx are implemented in Erlang/OTP and use all advantages provided by Erlang to deliver a scalable set of robust, distributed services.

Comprehensive Monitoring & Resource Management

The backbone of a Clustrx-driven computing cluster is its monitoring, management and control system, Clustrx Watch. Clustrx Watch is an innovative cluster-wide monitoring system capable of surveying millions of checkpoints (hardware sensors, SNMP data sources, traps, kernel and software metrics) in nearly real time, while scaling linearly. It features a hierarchical architecture of service data collection, aggregation, distribution, processing and logging that has been engineered to serve multi-petascale systems.

Clustrx Watch includes an advanced power manager integrated with its resource manager. Any unused hardware can be switched off quickly. An emergency shutdown system guards hardware against critical failures, such as if the cooling or power fails.

Nodes on which the monitoring system resides are mutually replaceable: as soon as any of those are found to be in trouble, they will exchange their roles intelligently, smoothly and transparently. This contributes to an unbreakable architecture with no single point of failure.

The resource manager for Clustrx, which is based on SLURM (Simple Linux Utility for Resource Management), is connected with the monitoring system, and relies largely on it. Its purpose is to launch computing jobs on compute nodes and track their execution. If Clustrx Watch finds nodes in a critical state, the two layers use sophisticated logic to re- distribute the computing load between the nodes.

Robust Performance

The basis for robust performance and no single point of failure is the division of a cluster into computing and management nodes and the creation of a deep hierarchical control structure. A cluster’s hardware and software resources are redistributed by the OS management infrastructure transparently for the user, to achieve stable and safe operation.

System services are implemented as distributed and virtualised ones, to run on management nodes in a floating and mutually replaceable fashion. Most important services include AAA (user accounts, authentication, authorisation), a highly controllable booting/configuring of compute nodes, a single configuration database named dConf (Distributed Configuration) that allows a customised access.

Single-Point Administration

Clustrx OS views an HPC cluster as a single super- computing machine, as a "black box" that aggregates the computing power of a large number of nodes (that can be totally diverse hardware and system platforms) into a comprehensible and scalable service that can be deployed, controlled, and distributed from a single point. This single point includes command-line and graphical interfaces, from where any administration task can be done manually or automated by scripts and OpenAPIs.

The whole suite can be deployed within hours, requiring only modest human effort. The queue of computing jobs, the user access rights, and as- signed limits of computing resources are all con- trolled via a single administration interface.

Easy To Use

Users work with an HPC cluster under Clustrx OS as they would with a single Linux/UNIX machine, but provided with a lot of flexible power. The user has the right to formulate specific requirements to a compute node boot image, check for the presence of certain tools and libraries, and place orders for whatever resources may be needed to run their compute jobs. Both command-line and graphical interfaces are available.

Data: 

Products details

Applications

  • Comprehensive working solution from ground up on an HPC cluster.
  • Adding an integral monitoring/supervision layer over existing HPC installations.

Used by

  • Moscow University, on two HPC clusters, ranked 13th on the Top500 Supercomputer Sites list