HPC-SIG Meeting, 2019-11-06

Location

University of Manchester

Admin

Attendance: See ti. to and head count.

Welcome

Graeme Murphy welcomed everyone.

Welcome to Manchester

Pen Richardson provided an overview of ITS team at Manchester touching on the research computing core services(research virtual machines, research software engineers, business analysts, applications support). She noted that the funding is mixed and this presents challenges for staff. Support is based on the CIR ecosystem (see slides when available) using CSX. The main CSX cluster is funded by researchers, with £4m funding providing 9500 cores with a varied workload (serial, small parallel). There is also an HPC pool, 4096 cores for larger parallel work. ICSF – the interactive element- provides 180 cores. In addition there is a condor pool with 1200 always on cores and 3000 on-demand cores. Sun Grid Engine is used as the scheduler, but this will be changed to SLURM. There is ample storage, currently Lustre but moving to GPFS. Various gateways are provided such as SSH but also including mobile apps. Moving from SGE to SLURM. Lustre to GPFS. AWS Spot bursting is being examined using software from Wisconsin. There is a considerable amount of ‘edge’ computing. Teaching support is handled via virtual labs and Jupyter notebooks.

David Salmon asked a question about AWS cloud bursting and it was agreed that case studies on this would be useful.

AI forPersonalised Medicine

Presented by Martin Callahagn from Leeds

The slides will be linked here when available.

Briefly this covered: identification of the breadth and commonalities in AI research across Leeds; multi-modal data – structured, unstructured, images. How to manage metadata, etc.; reasoning about underground cables- many unmapped; road repair drones(with supporting video); creative design- designs of new mechanical objects– machine learning creates many candidate designs which are then reviewed by humans; science discovery, especially biological research; robotic systems for locomotion through bio-mimicry. ; crowd analysis; patient pathway modelling (NHS funded); financial technology which is also a new masters programme- this examines risk analysis, audit, etc. , augmenting human auditors; robotics.

Challenges to AI/ML were identified including: HPC, GPU and the use of ad-hoc desktops- it is very hard to support this; the need to access more, quickly, so looking at cloud for this; ML with sensitive data a huge challenge; GPU backed JupyterHub; Azure Lab Services for MSc teaching; the introduction of a small elastic HPC cluster for School of Computing

There was a debate about whether cloud/elastic clusters are hard with regards to sensitive data. It was seen more as an issue with getting users to understand how to work with sensitive data in a controlled environment rather than technology. However, the change of culture is proving difficult. Lots of support will be required for users to understand and a change of culture is required. It was also suggested that ISO27001 may not be fully used and is not a good return on investment for intermittent usage.

Azure

Mark McManus from Microsoft

The slides will be linked here when available.

Briefly this discussed Azure overall: it’s use at 104 universities using Azure in UK (not necessarily research computing), but more of an interest now for research; that Azure uses carbon offsets which is a good selling point as it helps universities with carbon management; Project Silica; Project Palix; 1ZB in a rack within next 5 years; open source software supported; ISO compliance,etc- to ensure comply with issues for some research projects – easier than doing in house; management structure for Azure(Enterprise,Department,Account,Subscriptions (resources)); work on hybrid structures; Newcastle-1024 GPUs- could not procure other ways cost-effectively for a short-term requirement; VM options discussed; NGC containers; Azure Lab; bare metal an option; Microsoft Cognitive Services:

ML demonstrations: identification of scenes in videos – e. g. people, sitting, extraction of audio and transcription from video. Impressive stuff; identification of document dates via pattern recognition – e. g. thickness of paper, type of ink, linguistics, etc. ; Azure allowing centralising research offerings – researchers want things NOW, which can create shadow IT; cluster creation tool; Azure CycleCloud; Automated machine learning- e. g. pre-trained models and other services, popular frameworks,etc. and an easy GUI or does it make sense to use – need to look at this.

Questions asked included:

  • Who controls access to resources? The answer was that there needs to be central organisational support to control this.
  • Licence management. Answer:Microsoft has been working on this. List of vendors available and will be sent through
  • How can the expending of budgets be prevented. Answer: can have alerts and cost analysis to predict this eventuality and there are some ‘hacks’ to force things to stop that that institutions can put in place. Cliff Addison noted that it is possible to use orchestration layers (e. g. SLURM-in-the-cloud?) to help control this in a way that is more suitable for universities. A number of people expressed concern about the risk that universities may take without the ability to provide a hard cut-off of spend.

SIG Business

Finances

There is £32k in the bank. Income and expenditure discussed. Richard Martin then invited members to discuss how to spend the money, e. g. a report, publicity, etc. , website costs. A suggestion from the floor was travel bursaries for younger/early career members? Another was for travel support for people coming from the USA? This were all seen as quite possible.

Archer-2 from Andy Turner of EPCC

Slides to add here.

There are different NUMA options available but the details not yet known but this is not likely to be on-the-fly reconfiguration. Cray Slingshot interconnect can do various protocols, including Cray-specific, but also with an TCP/IP layer (with what support for RDMA?). There which a collaboration platform – not much known about this. The software will be similar to current Archer, but using SLURM. There will be optimised Python tools and they will include gdb4hpc.

Questions

  • RDF: situation unclear.
  • Not clear when there will be full user service.
  • Transition time: via Finland, hopefully. NERC users using Met Office? Tier 2s may also be included.
  • High up top 500?

Website

We want toshare more material, e. g. training and other material across the community. There is money available to be spent on the website.

A question was asked about what the core skills required for HPC support are and it became a discussion about skills. The required skills aren’t really known. Edinburgh has worked on this with regards to Bioinformatics – the answer was mostly core Linux skills as a lowest common denominator. An issue with training repositories was seen to be keeping them fresh and sustainability of the material. SIGHPC is also working on this. Can professional societies be engaged on certification? BCS could be called on for this as already involved in the modern apprenticeship area.

Tier 2

No current information on when the decision will be announced but it may be this Friday or after election? One was a joint bid for a training resource (Hull, Cardiff, Leeds, Strathclyde) – DevOps/ResOps- £0.7k plus effort in kind and a technology exploration platform in conjunction with cloud providers. HPC-SIG to help coordinate and/or curate training material.

Roving Reporter

Aaron Turner indicated his plans for Super Computing 2019.

GDPR Update

Aaron Turner indicated the current state with an imminent purging of the HPC-SIG list for Friday.

Amazon Web Services

See slides.

This covered: AWS Personalise; Lancaster University and Alexa skills for students; SageMaker(Machine Learning with ready built models and stacks); Various tools: Rekognition, Transcribe – speech to text with an Echo 360 back end, useful for lecture capture, etc. ,Translate,Polly: text to speech,Connect with Transcribe and sentiment detection,Comprehend – usable from boto3, Comprehend Medical, Lex – speech recognition and natural language understanding,Forecast – time series analysis, Personalize, Textract – printed documents to text; Support for ML workflows; Kubernetes, AWS Lambda,etc. for inference; overview of types and algorithm support; labelling support: SageMaker Ground Truth. E. g. annotations by humans via Mechanical Turk. Also, automatic labelling in parallel, depending on confidence that can be automatically labelled; SageMaker Neo: train once, run anywhere as it optimises for architectures; in the future: reinforcement learning (simulated environment, scoring, RL alg) ; AWS DeepRacer; AWS Docker containers – EKS and ECS container services; ML Solutions Lab, and university

Other items mentioned:

  • Working out what instance type to use to optimise.
  • Dynamic attachment of GPUs.
  • Inferentia – custom chip

Questions asked were:

  • About SageMaker availability – this is now available in UK regions.
  • The potential use of Comprehend for digital humanities was noted.

Coffee

Nvidia

Slides will be added when available.

A video was shown.

There was discussion of:converged HPC and AI; neural networks in climate super-parameterisation. Improve heuristic cloud models. CAM model; modelling the big bang; drug discovery- similar accuracy, but MUCH faster; neutrino detection-Improves accuracy of detector; gravity wave detection using convolutional neural networks; the Kokkos library.

There were discussions from the floor around:

  • The Big Bang – how does the neural network know it’s doing the physics correctly? The presenter did not know. Sensitivity analysis is required. Concerns were expressed regarding reproducibility. The need to ensure correct training data and ensuring correct validation, etc. was noted.
  • Regarding neutrino detection questions were asked about what was actually done. More information will be provided.
  • A question was asked about other companies doing ML chipsand whether they are a threat to Nivida? Answer: Nvidia is doing lots of work on software in frameworks now so feels confident.
  • A question was asked about what is the biggest challenge for using AI in HPC area – what is Nvidia’s view? Answer: Explainability of ML results. AI is now almost ubiquitous as an aid to HPC. Nvidia also working on simulation acceleration. Also looking at improved ease of use. More training. More on OpenACC.
  • It was noted that ML plus HPC is interesting, but how can it be exploited? Getting people to use GPUs is hard as researchers aren’t sure how to use them or what the potential is and why they should use them.
  • Concern was expressed about how to understand how to replace parts of workflow with a neural network?
  • Questions were asked about explainable AI – will there be any conclusions? Answer: It’s hard to do. Can use physics to ensure physics based NN is constrained to sensible physics, but not much more than that at present. Lots of work at Alan Turing. Replacing a component means more in terms of validation than explainability?
  • A question was asked about automating process for experiments. The answer was that this is being considered.
  • What are the upcoming developments?What improvements? Answer: Nvidia expects the GPUs to double speed over 2 years, but the GPU is only one part of stack – software is very important as well.

Wrap up and close