HMDSA tools

HMDSA Tools provide a variety of capabilities:

Data Collection Tools

  • Lightweight Distributed Metric Service (LDMS) – gathers numeric data, including platform, application, and facilities data, and provides a variety of transport and storage options.
  • System probes – periodically probes, stores, and displays performance and state information, such as the latency involved in performing Lustre meta-data and storage operations for identification of problems.
  • Baler – aggregates log data and associated “patterns” into a distributed database

Storage Tools

  • Short and long term data storage options.

Analysis Tools

  • Machine Learning (ML) based tools for runtime analysis:
    • Clustering and Classification techniques – extracting regions of interest and characterizing them in terms of severity, extent, and duration
    • Inferential techniques – probabilitistic determination of root causes and relationships
    • LogDiver – identifies probabilistic relationships between log events
    • Baler – identifies and tags uniuqe patterns of “tokens” associated with log messages, discovers spatio-temporal correlative relationships between tagged events
    • LDMS – provides capability for performing a variety of analytics on data along the transport path
  • ML collaborative partnerships:
    • Collaboration with Boston University (BU) working on developing ML based methods for runtime identification and diagnosis of a variety of performance degrading anomalies using a variety of system monitoring data
    • Collaborations with New Mexico State University (NMSU) and other SNL staff working on developing application profiling metrics
    • A multi-site collaborative effort that continuously determines and presents figures of merit for all system, subsystem, and performance metrics (e.g., network congestion state, communication performance) to enable comparison and diagnosis of job performance
    • The Blue Waters Project has an ongoing effort to architect and study ML models that process a system’s monitoring and usage data to create a mixture of descriptive, predictive, and prescriptive analysis to assist systems managers and performance experts with near real time monitoring and diagnostics.

Tools for Feedback and Visualization of Actionable Intelligence

  • Integrated System Console (ISC) provides a variety of dashboards and subsystem specific notifications. The ISC operates on data in the short term store.
  • LDMS – provides APIs and mechanisms for low latency feedback of information to system and application processes
  • Baler – enables user exploration, and user tagging, of events and event-to-event relationships via both GUI and CLI interfaces, will soon be configurable to send alerts on “Conditions Of Interest” (COI) to subscribers
  • LogDiver – provides continous resiliency analysis for HPC systems, and capture the relationship between error events helping find error propagation