HPC System Testing

Survey

https://tinyurl.com/system-test-bof

Survey of responses. Will find results later to copy in.

Lightning Talks

KAUST

  • ~200k cores
  • No special priveleges for tests [different to Lboro acceptance tests]
  • Component tests (scheduler, etc)
  • Synthetic tests (performance)
  • VASP, WRF, etc used as well as SPEC tests.
  • Minimised software tickets to basically zero in 24 hours following release to users. More reproducible for users.

NCSA

  • Testing of tests – debugging
  • Tracking of progress on acceptance tests
  • Initially manual, special job queue
  • Use of Jenkins (considered doing this for regression tests too – common approach)

DOE

  • Testing requirements may change over time – e.g. more stable as time goes on.
  • Statement of Works
  • Functionality tests
  • Reliability tests
  • System
  • Performance. Change over time. Run continually. Published to the web. This should be good practice.
  • Availabiity
  • Need to be flexible about new tech – vendors may not understand it yet
  • Boot/reboot testing important
  • Use ReFrame
  • Good configuration management and test systems for changes
  • Tests sourced from users.
  • Kabana and Grafana 

Indiana

  • 4 systems of approx 500-1000 nodes each

OLCF

  • Hardware acceptance, functionality, performance, stability (2 weeks)
  • Application:isolated test on each and meet contracted metrics (LAMMPS, NWCHEM, NAMD, etc)
  • Python test harness – being or is open source

CSCS

  • Avoid tying framework too closely to the system. Use high level.
  • ReFrame: Python
  • Easier to read

Future Work

  • There is an interest in continuing to work together
  • Responses will be on the web to the survey
  • Definitely something to look at within the UK community, e.g. common contract terms and their expression as tests using ReFrame, as an example.