Documentation
Publications, Presentations, and Links
Publications and Presentations
- *Machine Learning Assisted System Monitoring and Diagnostics @Scale*. A. Saxton, M. Showerman, and Blue Waters Team Members. Information Only Preprint.
- Holistic Measurement-Driven System Assessment (HMDSA). ECP Annual Meeting (Poster) Jan 2019.
- Monitoring Large-Scale HPC Systems: Extracting and Presenting Meaningful System and Application Insights. SC18, Nov 2018. BoF Session Organizer.
- Holistic Measurement-Driven System Assessment. W. Kramer. April 2018.
- Holistic Measurement-Driven System Assessment. S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, G. Bauer, J. Enos, M. Showerman, L. Kaplan, B. Bode, A. Greiner, A. Bonnie, M. Mason, R. Iyer, and W. Kramer. Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Sept 2017.
- Failure and Resiliency in the Shadow of Extreme Scale – Will our Current Assumptions Take Us in the Right Direction? W. Kramer. Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA) - Keynote Speaker, May 2016.
- *Machine Learning with Blue Waters Monitoring Data, Status Update. A. Saxton, M. Showerman, and Blue Waters Team Members. Information Only Preprint.
Selected Related Publications
- A ML-based Runtime System for Executing Dataflow Graphs on Heterogeneous Processors.. Banerjee, S. S., Athreya, A. P., Kalbarczyk, Z., Lumetta, S., & Iyer, R. K. (2018, October). A ML-based Runtime System for Executing Dataflow Graphs on Heterogeneous Processors. In Proceedings of the ACM Symposium on Cloud Computing (SOCC). ACM.
- Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning. O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. Leung, M.Egele, and A. Coskun IEEE Transactions on Parallel and Distributed Systems (Sep 2018) doi: 10.1109/TPDS.2018.2870403
- Characterizing Supercomputer Traffic Networks Through Link-Level Analysis S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, and R. Iyer Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Sep 2018.
- Taxonomist: Application Detection through Rich Monitoring Data E. Ates, O. Tuncer, A. Turk, V. J. Leung, J. Brandt, M. Egele and A. K. Coskun 24th Int’l European Conference on Parallel and Distributed Computing (Euro-Par), Aug 2018.
- Integrating Low-latency Analysis into HPC System Monitoring. R. Izadpanah, N. Naksinehaboon, J. Brandt, A. Gentile, and D. Dechev 47th Int’l Conference on Parallel Processing (ICPP), Aug 2018.
- Supporting Failure Analysis with Discoverable, Annotated Log Datasets. S. Leak, A. Greiner, A. Gentile, and J. Brandt. Cray Users Group (CUG), May 2018.
- Runtime HPC System and Application Performance Assessment and Diagnostics J. Brandt, A. Gentile, Jon Cook, B. Allan, Jeanine Cook, O. Aaziz, T. Tucker, N. Naksinehaboon, N. Taerat, E. Ates, O. Tuncer, M. Egele, A. Turk, and A. Coskun. Conference on Data Analysis (CODA), Mar 2018.
- Diagnosing Performance Variations in HPC Applications Using Machine Learning. O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun ISC High Performance 2017 (ISC), Jun 2017.
- Understanding Fault Scenarios and Impacts Through Fault Injection Experiments in Cielo. V. Formicola, S. Jha, F. Deng, D. Chen, A. Bonnie, M. Mason, J. Brandt, A. Gentile, L. Kaplan, J. Repik, J, Enos, M. Showerman, A. Greiner, Z. Kalbarczyk, R. Iyer, and W. Kramer. Cray Users Group (CUG), May 2017.
- Runtime Collection and Analysis of System Metrics for Production Monitoring of Trinity Phase II A. DeConinck, H. Nam, D. Morton, A. Bonnie, C. Lueninghoener (LANL), J. Brandt, A. Gentile, K. Pedretti, A. Agelastos, C. Vaughan, S. Hammond, B. Allan (SNL), M. Davis and J. Repik (Cray) Cray Users Group (CUG), May 2017.
- Resiliency of HPC Interconnects: A case study of interconnect failures and recovery in Blue Waters. S. Jha, V. Formicola, C. Di Martino, M. Dalton, W. Kramer, Z. Kalbarczyk, and R. Iyer. in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2017.2737537.
- Final Report Workload Analysis of Blue Waters. M. D. Jones, J. P. White, M. I., R. L. DeLeon, N. Simakov, J. T. Palmer, S. M. Gallo, and T. R. Furlani (SUNY) and M. Showerman, R. Brunner, A. Kot, G. Bauer, B. Bode, J. Enos, and W. Kramer (NCSA), University of Illinois at Urbana-Champaign, ACI 1650758, Jan 2017.
- Attacking supercomputers through targeted alteration of environmental control: A data driven case study. C. Keywhan, V. Formicola, Z. Kalbarczyk, R. Iyer, A. Withers, and Adam J. Slagell. In Communications and Network Security (CNS), 2016 IEEE Conference on, pp. 406-410. IEEE, 2016.
- Measuring the Resiliency of Extreme-Scale Computing Environments. C. Di Martino, Z. Kalbarczyk, R. Iyer, in Principles of Performance and Reliability Modeling and Evaluation: Essays in Honor of Kishor Trivedi on His 70th Birthday, L. Fiondella, A. Puliafito, Eds., Springer International Publishing AG Switzerland, pp. 609–655, 2016.
- Continuous Whole-System Monitoring Toward Rapid Understanding of Production HPC Applications and Systems A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson Parallel Computing (2016), Elsevier B. V., http://dx.doi.org/10.1016/j.parco.2016.05.009
- Large-Scale Persistent Numerical Data Source Monitoring System Experiences. J. Brandt, A. Gentile, M. Showerman, J. Enos, J. Fullop, and G. Bauer. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS). May 2016.
- Network Performance Counter Monitoring and Analysis on the Cray XC Platform J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh Cray Users Group (CUG), May 2016.
- Dynamic Model Specific Register (MSR) Data Collection as a System Service G. H. Bauer, J. Brandt, A. Gentile, A. Kot, and M. Showerman Cray Users Group (CUG), May 2016.
- Design and Implementation of a Scalable HPC Monitoring System for Trinity A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, and M. Mason (LANL), J. Brandt, A. Gentile, B. Allan, and A. Agelastos (SNL), M. Davis and M. Berry (Cray) Cray Users Group (CUG), May 2016.
- Analysis of Gemini Interconnect Recovery Mechanisms: Methods and Observations. S. Jha, V. Formicola, C. Di Martino, Z. Kalbarczyk, W. Kramer, and R. Iyer. Cray User’s Group (CUG), Apr 2016.
- Infrastructure for In Situ System Monitoring and Application Data Analysis J. Brandt, K. Devine, and A. Gentile In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization (ISAV 2015) at IEEE/ACM Int’l. Conf. for High Performance Storage, Networking, and Analysis (SC15), Nov 2015.
- New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Sept 2015.
- Extending LDMS to Enable Performance Monitoring in Multi-Core Applications S. Feldman, D. Zhang, D. Dechev, and J. Brandt Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Sept 2015.
- Toward Rapid Understanding of Production HPC Applications and Systems A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Sept 2015.
- Logdiver: A tool for measuring resilience of extreme-scale systems and applications. Martino, Catello Di, Saurabh Jha, William Kramer, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. In Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, pp. 11-18. ACM, 2015.
- Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs. Di Martino, Catello, William Kramer, Zbigniew Kalbarczyk, and Ravishankar Iyer. In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on, pp. 25-36. IEEE, 2015.
- Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on Trinity J. Brandt, D. DeBonis, A. Gentile, J. Lujan, C. Martin, D. Martinez, S. Olivier, K. Pedretti, N. Taerat, and R. Velarde Cray User’s Group (CUG), April 2015.
- Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker, IEEE/ACM Int’l. Conf. for High Performance Storage, Networking, and Analysis (SC14) Nov 2014.
- It Takes a Village: Monitoring the Blue Waters Supercomputer. B. D. Semeraro, Robert Sisneros, Joshi Fullop, and Gregory H. Bauer. 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp 392-399, 2014, doi:10.1109/cluster.2014.6968671
- Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping J. Brandt, K. Devine, A. Gentile, and K. Pedretti1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Sept 2014.
- Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters Di Martino, Catello, F. Baccanico, W. Kramer, J. Fullop, J, Z Kalbarczyk, and R Iyer. The 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2014), Jun 2014
- Toward Understanding Congestion Protection Events on Blue Waters Via Visual Analytics. R. Sisneros, K. Chadalavada. Cray Users Group (CUG) 2014
- A Diagnostic Utility For Analyzing Periods Of Degraded Job Performance J. Fullop and R. Sisneros. Cray Users Group (CUG) 2014.
- Large Scale System Monitoring and Analysis on Blue Waters Using OVIS M. Showerman, J. Enos, J. Fullop (NCSA), P. Cassella (Cray), N. Naksinehaboon, N. Taerat, T. Tucker (OGC), J. Brandt, A. Gentile, and B. Allan (SNL) Cray User’s Group (CUG), May 2014.
- High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6 J. Brandt, T. Tucker, A. Gentile, D. Thompson, V. Kuhns, and J. Repik Cray User’s Group (CUG), May 2013.
- Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-Scale HPC Systems. Ana Gainaru, Franck Cappello, and William Kramer. IEEE, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp 1168- 11792012, doi:10.1109/ipdps.2012.107
- Fault Prediction Under the Microscope: A Closer Look into HPC Systems Gainaru, Ana and Cappello, Franck and Snir, Marc and Kramer, William. IEEE Computer Society Press, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ‘12), pp 77:1-77:11, 2012.
- Filtering Log Data: Finding Needles in the Haystack L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile 42nd Annual IEEE/IFIP Int’l. Conf. on Dependable Systems and Networks (DSN). June 2012.
- Adaptive Event Prediction Strategy with Dynamic Time Window for Large-Scale HPC Systems. Ana Gainaru, Franck Cappello, Joshi Fullop, Stefan Trausan-Matu, and William Kramer. ACM Press, Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques (SLAML ‘11), pp 4:1-4:8, Cascais, Portugal, 2011, doi:10.1145/2038633.2038637
- Modeling and Tolerating Heterogeneous Failures in Large Parallel Systems. E. Heien, D. Kondo, A. Gainaru, D. LaPine, W. Kramer, and F. Cappello. ACM Press, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC ‘11), pp 45:1-45:11, Seattle, Washington, U.S.A., 2011, doi:10.1145/2063384.2063444
- Framework for Enabling System Understanding J. Brandt, F. Chen, A. Gentile, C. Leangsuksun, J. Mayo, P. Pebay, D. Roe, N. Taerat, D. Thompson, and M. Wong 4th Workshop on Resiliency (Resilience) in High Performance Computing at Euro-Par 2011, Aug 2011.
- Event log mining tool for large scale HPC systems. Gainaru Anna, Franck Cappello, Bill Kramer. Proceedings of Europar 2011, Aug-Sep 2011.
- Baler: Deterministic, lossless log message clustering tool N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun. In: Computer Science - Research and Development Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3 Int’l. Supercomputing Conference (ISC). June 2011.
- Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong 1st Int’l Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) at the 40th Annual IEEE/IFIP Int’l. Conf. on Dependable Systems and Networks (DSN) June 2010.
- Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong Workshop on Resiliency in High-Performance Computing (Resilience) in Clusters, Clouds, and Grids at the 10th IEEE Int’l. Symposium on Cluster, Cloud, and Grid Computing (CCGRID) May 2010.
- Combining Virtualization, Resource Characterization, and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong 6th Workshop on System Management Techniques, Processes, and Services (SMTPS) - Special Focus on Cloud Computing at the 24th IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) Apr 2010.
- An Exascale Approach to Software and Hardware Design. W. Kramer, and D. Skinner. International Journal of High Performance Computing Applications Nov 2009 23: 389-391, doi:10.1177/1094342009347768.
- Consistent Application Performance at the Exascale. W. Kramer and D. Skinner. International Journal of High Performance Computing Applications November 2009 23: 392-394, doi:10.1177/1094342009347700.
- Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong Workshop on Resiliency in High Performance Computing (Resilience) at the 18th ACM Int’l. Symposium on High Performance Distributed Computing (HPDC) June 2009.
- Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong 5th Workshop on System Management Techniques, Processes, and Services (SMTPS) - Special Focus on Cloud Computing at the 23rd IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) May 2009.
- Using Probabilistic Characterization to Reduce Runtime Faults on HPC Systems. J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong Workshop on Resiliency in High-Performance Computing (Resilience) at the 8th IEEE Symposium on Cluster Computing and the Grid (CCGRID) May 2008.
- OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters. J. M. Brandt, A. C. Gentile, D. J. Hale, and P. P. PébayThe 2nd Workshop on System Monitoring Tools for Large-Scale Parallel Systems (SMTPS) at the 20th IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) Apr 2006.
- Meaningful Automated Statistical Analysis of Large Computational Clusters J. M. Brandt, A. C. Gentile, Y. M. Marzouk, and P. P. Pébay at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Sep 2005.
Related Links
- The HMDR Project: Holistic, Measurement-Driven Resilience - Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact
- BlueWaters Machine Status (Live)
- BlueWaters Torus View (Live)
- Lightweight Distributed Metric Service (LDMS)
- OVIS Web Site
- Monitoring and Analysis of HPC Systems Plus Applications (HPCMASPA) Workshop Series