Resource Aware Scheduling

Compact and Exclusive: common in SLURM, SGE, etc. Self-contention, resource under-utlisation are issues. (C-n-E is for example, SLURM node-exclusive).

Propose: spread and share model. E.g. share nodes where the resource usages allow them to do that and vice-versa.

LLC issues – spreading means fewer issues with cache usage.

Cache allocation technology – can indicate how the cache is shared over cores. Now common. Cf. ARCHER architecture choices in BIOS.

Example: 16/28 cores in C-n-E but 24/28 in S-n-S and faster execution (no self-contention for cache, etc.). [Would be interesting to consider contention for other resources, e.g. I/O].

S-n-S doesn’t benefit all. Important to benchmark codes, e.g. 1 node, 16 cores/node to 8 nodes, 2 cores/node. Sometimes spread applications when spread have more cache misses.

Use of mpiP to measure things like inter-process communication time.

Question: use typical behaviour on scaling to make scheduling decisions. Need a database of historic profiling data. But on some codes (e.g. NWCHEM) where there are so many sub-elements is this plausible? Also would need DBs for the hardware actually being used, so a big database. Or does the DB need to look too at things like command line options, the models, etc., etc? This has been looked at before in terms of scaling/execution databases but never really seemed to go very far. ‘Only a few runs to capture program behaviour’ – I am not sure if this is necessarily true, as noted above.

SNS Profiler on github. Uses Intel CAT and Linux perf. No program modification required.

Batch job resubmissions common. Piggy back on normal runs for different scales of submission. But this is changing LLC behaviour using CAT of user jobs which might change job execution time which I can see would annoy people! People can set an indication of what slowdown they would be happy with, but many will say none.

SNS daemon on each node. SNS means a new scheduler. Would probably be more popular as a set of plugins for scheduling decisions for existing ones like SLURM. I can’t see people changing to a new scheduler with unknown support and no commercial support.

Works with TensorFLow, Spark, MPI.

Discussion of experimental results. Looked at node-sharing with no resource constraints also compared. CS – 13.7% improvemnt over basic exclusive, SnS +20% on average, sometimes more. Would like to see more information. This is for throughput. But also improves job run time on average. CS can really slow down jobs (jobs contend for resources).

It can increase wait times on small clusters as need to match jobs. But at 32,000 core cluster, no issue. But how small is small? E.g. 2500 cores? Is that an issue for wait times? Does wait time also imply poor overall utilisation in terms of jobs NOT running? Would like to see more detail and additional policy options to deal with this.

Question: What if resource usage of a job changes over time? Yes, this can be an issue.

Scheduling and ML

ML to optimise workloads.

How to improve convergence for ML models when looking at levels of parallelism.

Didn’t really understand the zoom in to parallelism request and parallelism.

Use of an estimation function makes more sense.

Basically use estimation function to determine deltas in performance given small peturbations on the configuration (i.e. parallelism) and use that to ensure that can pick the optimal parallelism.

Region-based reinforcement learning

Hard to follow.

Prototyped for improving throughput on running TensorFlow.

Result graphs shown. Inception (the model being used) does seem to improve the speed at which results are returned for the TensorFlow service. Request serving latency much reduced. This is for fairly high frequency inference serving, though, which is an interesting scheduling question for, say, Alexa, Siri, but less so for our more traditional HPC applications which isn’t working on high frequency inference serving, but that is still a valid research service for AI researchers.

Comment: Can result in reduced cost on things like AWS by reducing latency. Reduced latency would mean could use fewer instances to serve a workload.

Slack Squeeze Coded Computation from Adaptive Straggler Mitigation (S2C2)

Hard to hear the presenter and didn’t follow this.

Results for any k of n nodes are enough to create a full result. E.g. support vector machines.

Wasted computations can be a concern.

Stragglers – nodes that compute slowly. Cf. Apache Spark model that offers timeouts and the ability to recompute results that a straggler would have produced. But for k in n nodes being enough then recomputing work by stragglers is a wasted computation if we have >=k results already. Instead, spread the work (matrices) over multiple nodes.

At worst, S2C2 devolves to standard MDS, so can only win over that.