HPC-SIG Meeting 2019-07-01

This was held at Queen’s University, Belfast, with an attendance of just under 25.

Agenda

The original agenda was:

Time Activity
10:00-10:30 Arrival, Registration and Coffee
10:30-10:45 Chair’s update
10:45 – 11:30 Carpentry Connect (Martian Callahagn)
11:30 – 12:00 Belfast City Deal,  Project Goals and timescales (Dr.  Seamus Doyle)
12:00-12:30 Network with colleagues from the Republic of Ireland
12:30-13:30 Lunch and networking
13:30 – 14:00 ICHEC Update (Dr. Simon Wong): ICHEC Overview and the challenges of Big Data and AI / training
14:00 – 14:30 EPSRC Funded CDT In AI and Big Data: UKRI Centre for Doctoral Training in Artificial Intelligence, Machine
Learning & Advanced Computing – Impact and Objectives (Dr. Gert Aarts)
14:30 – 15:00 How are we supporting the researchers using ML – hardware and software? (Vaughn Purnell, QUB)
15:00-15:30 COFFEE BREAK and Networking
15:30 – 16:00  Panel session with speaks from the day: SME, AI, Infrastructure challenges – what can we do collaborative to embrace these opportunities?
16:00 – 16:15 Wrap Up / AOB / DONM
16:20 CLOSE

 

Additional notes from the meeting are as follows:

Chair’s Roundup

  • Jacky Pallas – after her sad death there will be a memorial award at CIUK.
  • The UnConference is coming on 17th July, in Birmingham.
  • The Terms of Reference have undergone revision and will be placed on the website.
  • There are provisional dates for a new Tier 2 call. This is 11/7 – call, 1/8 – intent to submit, 10/9 – full proposal, Oct 19 review, Nov 19 confirmation, March 2020 spend. Expenditure is 7.5m a year over two years for all centres with an assumption of 4 years in service. This will be a service to supplement the existing Tier 2s, and is open to domain specific and novel architectures as well as generic ones. There will be some funding for operations. It is not clear if there will be an additional call to replace, rather than supplement, existing Tier 2 systems. The requirement for access is primary EPSRC-funded research of where a group of funders includes EPSRC.
  • There will be a break in ARCHER service in 2020 to allow installation of ARCHER2. EPSRC is looking at mitigation. ARCHER will cease on 19/2/2020 for 11 weeks. No message has been sent out to users yet. The RDF will go off support in April 2020 so is not suitable for data staging. Prioritisation is required. Rebecca (Howe?) at EPSRC will coordinate with the HPC community on this. Sites in Finland may be used as a mitigation, possibly Tier 2 sites.
  • RCUK news: NeI roadmap signed off. There are personnel changes ongoing at EPSRC.

Carpentries

  • Report from the three day event which included curriculum development. 
  • About 80 people present, mostly but not all from the UK
  • Aimed at building training communities, with a question being raised as to how the HPC community can get involved in this.
  • HPC Carpentry is being developed by Andy Turner (EPCC)
  • Machine Learning Carpentry is in development
  • Digital humanities carpentry is in development
  • Genomics carpentry is in development
  • Carpentry sources are open source but must be attributed and can only be badged as software carpentry if led by a qualified facilitator.
  • There are links with the USA via its cyber ambassador. See colby.github.io and the Campus Research Computing Consortium which provides things such as ‘Super Computing in Plain English’
  • PRACE is involved in similar areas with MOOCs
  • Mentoring was discussed as an option, and the use of placement students to cover skills gaps, but a concern was raised as to the capacity of current operations to bring people up to speed. In particular Birmingham noted work on this, and EPCC reported positively.
  • The use of training days for staff was noted.
  • It was noted that non-HPC IT staff might want to move over.
  • The lack of women (about three or four only present) was noted but that at the undergraduate level there is interest but that this does not seem to carry through to senior staff. This may partly be due to language used in adverts, so this needs to be addressed.
  • UCL noted the usefulness of schemes to bring people back from maternity and other leave and schemes to support people coming in from the armed forces. FDM group mentioned.
  • Professional skills training was discussed, although much is RHCE (Red Hat) and so on. ITIL training is seen as useful, especially associated with things like risk assessment skills.
  • More use of vendor training could be made
  • A masters in HPC?
  • Leeds noted it struggles to recruit at the right level
  • Leeds noted its apprenticeship schemes, and this is ongoing in Salford and Northumbria too.
  • Lack of grade progression can disuade retention
  • Jobs could be advertised via the RSE network, even if system admin and via other methods, e.g. Twitter.

Belfast City Deal

  • Belfast’s current HPC system was outlined, which is currently 1200 cores, in operation from 2015. City Deal helps with funding this. Support is via 2 FTE. It provides services for teaching as well as research. Users are added at the rate of around 10 per month.
  • There is a strong software industry in Belfast
  • City Deal is 350m over several years and includes Queen’s University, Belfast and Ulster University. Of this 56m comes to QUB in some form, or which around 10m for a ‘real time AI supercomputer’ to support things like medical research over six domains, aiming to support the creation of up to 20,000 jobs. KPMG will measure this impact. This is anticipated to be a Tier 2-sized system with CPU and GPGPU.
  • There will be liaison with government.
  • Christine Kitchen noted that with multi-tenancy with licences.
  • Andy Turner asked about storage, e.g. HDD versus NVMe.
  • There will be new staff for this.
  • It was noted Edinburgh has a City Deal.
  • The timescale is 2020: Proof of Concept and procurement, 2021: installation. Procurement probably via a framework, if one is available. Andy Turner asked about vendor presence in NI.
  • City deal uses a phased release of funds over 10 years
  • Bursting to cloud may be examined
  • Google’s TPU may be of interest.
  • There will be a new data centre. There was a discussion of options in the meeting but there is nothing definitive yet. Power is a concern. It may not be at QUB if power is better elsewhere provided networking is sufficient.
  • There was a discussion about WEKA.
  • AI training will be a component
  • Andy Turner noted that Cirrus checks for node health in the job epilog.

ICHEC and AI

  • An overview was given. This is a national service free for RoI academics, hosted in Galway.
  • Time is allocated by an approvals panel including a science advisory council. Andy Turner asked about this and how scientific merit was judged and what would happen if the HPC time for a grant wasn’t awarded. The process if not fully integrated by communications seek to avoid this happening.
  • Condominium model (buying in) was discussed
  • Institutional allocations are handled by them.
  • There is an access tier for benchmarking, often required for PRACE applications. ICHEC assists with this.
  • Tech includes DDN, CUDA, FPGA, over multiple systems overlapping in time.
  • BurstBuffers in use (oil and gas like these).
  • There is an Earth Observation Portal supported called SPEIR.
  • Christine Kitchen asked how domain-specific time allocations are handled – this is handled by allocating RSE time.
  • The need to handle big data is emerging, especially satellite imagery and health and renewable energy forecasting.

Support for Machine Learning

  • Training is offered, which is accredited.
  • Jupyter and JupyterHub is a good resource. See especially Chris Woods’ work with JupyterHub, Kubernetes and Azure.
  • Carpentries are useful for training and Andy Turner invited contributions on the code and pedagogy aspects. There may be concerns, however, about the pedagogy model for advanced technical material. ICHEC n oted it will be focusing on low level carpentry type activities. Andy Turner asked in PRACE could assist.
  • Christine Kitchen asked how training can be accredited via a diploma. In RoI this is done via how institutions.
  • The existence of CDTs for AI was noted. These are still recruiting. There were 84 expressions of interest, 37 proposals, 16 awarded, with first students due in October 2016 for 4 year PhDs with a comprehensive mix of teaching, research and placements that offers greater support than traditional PhDs. Of particular in interest in the one the presenter talked about is spiking neural networks (cf. Steve Furber). Andy Turner noted a concern about diversity of students, and the issue with quotas, but the diversity seems reasonable in this instance. This will support 55 students over 5 years for a cost of 5.5m plus industrial partner funding. Andy Turner noted the debate over hard KPI targets and it was suggested it is better to avoid hard ones, which involved some debate. Aaron Turner asked if placements also supported diversity, e.g. students with children.

Panel Session

  • Discussion of the right type of GPU to support ML – V100s versus HTC over prosumer?
  • GPU versus FPGA was debated, but FPGA is harder to use. What are user requirements here?
  • Will there be new offerings from Intel in about 6 months?
  • Cascade Lake and reduced precision was noted. TensorFlow benchmarks on Tier 3 was noted.
  • Christine Kitchen noted that ML might mean a change to I/O architectures to better support it.
  • Singularity and Kubernetes and cloud was noted as well as Shifter.
  • QMUL noted that HPC is too hard for many ML users, or is perceived to be. Part of the issue is not knowing when there might be convergence (are there ways this might be automated). Some do use HPC, though. Interaction models may also not fit HPC.
  • Will Nvidia be continuing with an academic discount (currently around 50%). There may still be some discounts? MiniDGX.
  • It was noted that Nvidia plus high speed interconnects is still hard to do, especially with OPA. Andy Turner noted that most researchers look at single node work.
  • It was noted as the CDT director meeting HPC for AI/ML was not a large demand.
  • Inference engines (results of trained systems) need much lower resources.
  • Nvidia offers training materials. To become an Nvidia trainer takes 4 months.
  • Is there generic training that can be offered, or does it need to be domain-specific?
  • How much training can be reused? Are Coursera or other MOOCs useful or too generic?
  • Training on structuring for efficient I/O was seen as useful.
  • Is there a need for an ‘Is AI for me’ course?
  • Coaching through the RSE network might be an option.
  • Data carpentry – is this useful?
  • Keeping courses relevant is hard.
  • There was a concern about ML, ‘black boxes’ (rule extraction issues) and reproducibility. 
  • Is ML always the right tool?
  • Christine Kitchen noted that there is a lot of dabbling by providers.
  • ML versus standard data analytics?
  • A good element of ML is it brings in new communities of users, e.g. humanities, archaeology, etc. But diverse communities can be hard to support with no additional resources
  • Software frameworks make for large dependency stacks, so need containerisation.

DONM

  • Some time in October, to be canvassed via the email list
  • May include a presentation from Nvidia
  • Some talk of Edinburgh as a venue.

AOB

  • RSE conference: already full
  • GDPR update.