HPC-SIG Meeting, 2022-06-28, Birmingham

Location: University of Birmingham

AM Group Discussion

First session was a chance for the community to reconnect and to discuss the challenges that they are facing at the moment. Challenges that were particularly identified and referenced:

  • Challenges – risk of silos developing with teams dedicated to RSE, Sysadmin, Infrastructure
  • Challenges – scalability and complexity in terms of:
    • People/teams involved
    • User demands and workflows (CPU/GPU)
    • Professionalism and best practices – e.g. the move from single owner controlled scripts to git controlled and approved processes
  • Challenges – Organisations approach to agile working
  • Challenges – User training during Covid
  • Challenges –Security:
    • MFA in HPC (particularly for external users)
    • Workable federated access
    • Authentication and authorisation are still unresolved challenges

Tier 2 Updates

An update from the EPSRC Tier-2 HPC services

  • Baskerville–Ed Edmondson, University of Birmingham
  • Cambridge Service Data Driven Discovery (C2D3) – Paul Calleja, University of Cambridge
  • Cirrus (EPCC) – Kieran Leach, EPCC
  • Isambard (GW4) –Thomas Green, Cardiff University
  • Materials and Molecular Modelling Hub (UCL) – presented by Andy Turner on behalf of Heather Kelly
  • Bede (N8) – Alan Real, Durham University
  • HPC Midlands Plus – Matt Ismail, University of Warwick

UKRI Update (Justin O’Byrne)

Justin provided an update on UKRI Digital Research Infrastructure (DRI) plans•See slides•Request for information on long lead times for hardware

HPC Potted History talks and discussion

  • Formed in 2005
  • Drivers being: sustainable funding models, developing a community, visible entity in the UK to inform
  • The need for the SIG is here, but maybe need to re-evaluate what we should be
  • HPC is maybe not the tag, perhaps Research Computing
  • How do we work / collaborate with communities like Society of RSE
  • Important to move away from or at least avoid being bogged down too much in politics and maintain the technical agenda to ensure it is interesting to a wide variety of people
  • There were technical working groups and that is something that should carry on
  • Important it is closed, vendor free platform

Afternoon Discussions

In the first part of the afternoon the group split into groups to discuss specific topics that were of interest to the attendees. The lists below give an indication of the topics discussed in each of the groups.

Recruitment (led by Andrew Edmondson)

  • Staff lured away to industry by salary and also remote working
  • Increasingly out of step with industry. Need to stop benchmarking against ourselves and instead benchmark against industry offerings
  • Market supplements gone from rarities to increasingly being used
  • What training and growth opportunities
  • How do we make job adverts work to recruit new people
  • How do we help?
    • Embrace remote working
    • Help with career development by looking to grow peoples’ careers

Diversity and Inclusion (led by Matt West)

  • Temp contracts and poor pay can more greatly impact certain demographic group
  • Have at least two members responsible for ongoing EDI work, reporting back to the group i.e. recruitment efforts, statistics, best practic
  • Example of DRI retreat –efforts to get diverse panel member
  • Recruit more widely, not just jobs.ac.uk
  • As a collective, how do we make the sector more attractive to work in?
  • Student cluster competitions give people an insight into the role
  • Are people not interested or just not aware of the role?

User Requirements and Engagement (led by Simon Burbidge)

  • Spending time with users to really work out what they need
  • Additional requirements around supporting ML/DL/AI
  • Should we be more proactive in supporting users?
  • Driving licenses for HPC users
  • In person training helps build relationships
  • The SIG used to share training materials
  • Look at the RSE training on code management etc
  • Don’t have enough staff to give users what they expect, even if what they expect is unrealistic

Security (led by Jimmy Cross)

  • MFA, Federated Services/SSO are topics of interest
  • Cyber Essentials – is HPC in or out of scope?
    • Misunderstanding of requirementsoISO islands at Cambridge
  • Challenges of kernel patching in the context of IB/GPFS drivers etc
  • TRE/DSH may not be returning on the input effort

Net Zero (led by Owen Williams)

  • Emissions, not energy, is typically what we are speaking about
  • Slurm energy accounting plugin reports back on the energy a job has consumed
  • Idle nodes account for a fair % of the peak energy
    • Seems to be linked to memory power usage rather than processors/GPU
  • There is a potential study to be done on looking at energy use in HPC – what does the energy use landscape look like, what changes could be made to improve energy use. DRI Net Zero project includes some work in this area
  • Underclocking GPU cards can significantly reduce power consumption without impacting performance too much
    • Same is true for CPU
    • Performance impact does depend on application
  • Might be more down to HPC/RSE staff to tackle/raise awareness of this rather than the academic community
  • If you give people an allocation (even for free), as opposed to free general access, they will be more careful with their use
  • There is a DRI Net Zero project that includes work in this area: https://net-zero-dri.ceda.ac.uk/

SIG Meeting Format (from Workshop discussion)

In the second part of the afternoon, the whole group discussed how HPC-SIG meetings should be organised in the future to meet the requirements of the community.

What formats should we have for HPC-SIG meetings in the future?

General principals:

  • Should have mixture of content including landscape/strategic stuff and technical stuff
  • Should be available/useful for staff at all grades

Proposed general HPC-SIG meeting format:

  • AM shared sessions:oSite updates
    • Reports from SC/ISC etc
    • Funding updates
    • General interest presentations
    • Lightening talks/interesting stuff
  • PM –split to two strandsoProposals / management / careers
    • Technical / hack sessions

Also an interest in having separate technical sessions focussed on particular topics that give people a chance to have a go at implementing or trying different technologies rather than just listening to presentations.

Proposed format

Technical meetings (0.5-1 day)

  • Short presentations to set the scene
    • Rest of the session spent in small groups working to implement something of use at each site –one “expert” per group
  • Online in first instance to give as much opportunity as possible for people to attend

Ideas for technical meetings

  • Monitoring
    • Presentations from sites using CheckMK, IcingaoVisualising monitoring data
    • Possibly presentations from AWS, Azure, Oracle…?
  • ReFrame Regression testing
    • Presentation from the ReFrame developers?
    • Shared libraries of tests
  • Monitoring as a big data problem
    • e.g. Using ML to predict failures•Inter-site network monitoring
    • What are JANET/JISC currently doing
    • How much data actually flows between different HPC/data sites
  • Interconnect technologies and architecture
    • New/different technologies?
    • Separate networks for IO, management etc.?
  • Resource usage monitoring
    • % CPU/GPU, mem/cache, interconnect, storage B/W & MDS
    • Altair Mistral, DDN Lustre Monitoring, CDS View, Darshan, Slurm