The Role of Cyberinfrastructure in Science: Challenges and Opportunities

Important to deliver science, especially (now) workflow management.
The use of Open Science Grid (USA) for work was noted.
‘CI Practitoner’ where CI is cyberinfrastructure.
Science Gateways Community Institute – interfaces to allow people to use systems easily and effectively.
Data is extremely important – collection, curation, metadata, availability, fusing data from many sources, etc. An example would be linking fire reporting to weather prediction sources for wildfire simulation. This can also include connecting scientific instruments to computational resources for both data collection but also driving the parameters of the instrument to gather the needed data.
Detection of gravitational waves (an effort with includes Prof. Martyn Guest of Cardiff, UK, and leader of Supercomputing Wales).
This all add complexity and needs a workflow to use instruments, deal with data and map onto the appropriate compute resources.
Pegasus is the workflow tool being promoted in this talk.
Resources change, don’t want to have to change the workflow each time, so there is a lot of work at the back end to map things efficiently.
Tool includes visualisation and monitoring.
Campus clusters, HPC, HTC, cloud
Pegasus uses :
- files and individual applications
- workflow description, in a portable way between back-end resources
- captures provenance for reproducibility. (Cf YouShare)
Abstract workflow and executable workflow. Cf. Java workflows?.
Use of data staging nodes to assist the workflow (in the mapping from the abstract to the concrete implementation). I’d be interested how delegation of permissions works to enable these additional resources to be spawned on behalf of the user by the workflow engine and security implications. This might not be an issue for, say, looking at gravitational waves, but trust relationships and brokering might be required to support more sensitive workflows. This is an interesting question to examine.
Connecting workflow management and resource provisioning systems: create the resources that are right for the workload when required, e.g. on-demand tornado tracking and prediction in Dallas.
Robust components via leveraging HTCondor so can focus on:
- Aurtomating data management
- workflow planning, re-planning
- execution engines
- provenance
- data integrity (checksums)
Use of AI to detect issues with the workflows and the complexity of workflows on many and changing resources. This is an interesting approach and I can see value at other levels of HPC too, such as (and this is also being done) system monitoring using time-series analysis, cf. AURA-Alert which I worked on.
There is a challenge to ensuring correctness of results.

Questions:

Use own scheduling options? Yes this is possible.
Sacrifices: compromise on speed/accuracy, for example. Yes, there are domain-specific interfaces, e.g. selection of models. [My criticism that this is not very dynamic. I would prefer to see something like an accuracy tolerance and a date by which the result is required and then a more automatic selection of model however it would be very difficult to model how long a job might take and the level of actual accuracy to make such a scheduling decision).
Connectors to HPC resources (an API for HPC) would be good. This seems similar to things like CREAM-CE and SAGA and the various cloud-agnostic interfaces or even things like DRMAA of old.