Heterogeneous HPC


  • Background:
    • Optimization:
      • Tiling for complex cache hierachicies on multicore
      • GPGPUs, task fusion.
      • FPGA
  • DaCe: a way to solve this?
    • Map application to a work and data-flow graph
    • Have a ‘performance engineer’ to rework the workflow to make it more efficient in terms of data usage, cache, etc.
    • Creates a template for that problem to then apply transformations to example workflows from that domain scientist?
    • Parallel dataflow programming :
      • Width of parallelisation [requires chunking in execution, surely?]
      • This is assuming data independence.
      • The formalism effectively uses an unrolled loop (replicator) component to handle this.
    • Various types of data container supported.
    • [I am bit concerned by the idea that this requires a performance engineer. Good parallelising compilers do this sort of thing anyway, so isn’t it possible to leverage this? I am not seeing the direct relevance for heterogeneous computing, but now this is addressed]
      • Answer: mapping objects to CPU/GPU/etc to deal with relevance to heterogeneous computing. 
      • [But I am still confused as to why this doesn’t leverage existing technologies for decomposition. This having been said, people still explicitly parallelise with OpenMP. Of course multi-node with MPI is a wider issue.].
      • Indicated that hierarchical parallelism can be expressed, [but then Chapel, X10, etc., do this too]
    • Implementations:
      • Python language of choice. (Becoming defacto scientific computing language?)
      • Decorators used for numpy: @dace.program and can also use own functions. And others like @dace.map.
      • [Touching back on the X10, Chapel comment, this is potentially less effort than the above, and more like OpenMP pragmas. My question would be how efficient the code is versus the programming effort between this and other approaches. The results were posted later, but a bit hard to read.]
    •  DIODE – IDE.
    • Performance graphs shown but the labelling was too hard to see at distance and with heads in the way (the venue was absolutely packed). Up to 12x faster on FPGA in some instances? Need to read the paper to really understand the results.


pip install dace

So I tried pip3 install dace which also installed

absl-py-0.8.1 astunparse-1.6.2 click-7.0 dace-0.9.0 flask-1.1.1 graphviz-0.13.2 itsdangerous-1.1.0 mpmath-1.1.0 networkx-2.4 ply-3.11 sympy-1.4 websockets-8.1

Not that I have a handy Python example to test with.


  • Tensorflow: traverses the tensorflow graph and builds the mapping and graph.
import dace, numpy
def f():
a = numpy.zeros([10,10])
b = numpy.zeros([10,10])
c = a * b

Results in an error…

dace.frontend.python.newast.DaceSyntaxError: Function "numpy.zeros" is not registered with an SDFG implementation  in File test.py

So would need to look at this more.

Failed to install with Python 2.

Second Talk

  • Speaker not clear so couldn’t follow.
  • Streaming messages.
  • github.com/spcl/smi
  • FPGAs supported
  • Essentially creating channels down which messages can be sent, rather than channels being established for messages? Includes collective channels.
  • PoC reference implementation.
  • Connecting multiple FPGAs via communication kernels as part of the FPGA code.
  • Benchmarks comparing agains OpenCL and MPI to connect the components

Google (TPUs)

  • TPU – Tensor Processing Unit
  • TPU pod – multiple TPUs connected [in theory, for the right workload, 100+ PFlops. Cost?]
  • Optimisation of workflows on TPU requires reorganisation of matrices.
  • Hard to follow.
  • Couldn’t read table of results.