Better Data Structures

Graph Processing

The examples of graphs seemed limited (e.g. social media) when they have also been used for things like 3D representations of small molecules for drug discovery (AURA GraphMatcher and AURAMol that I worked on) as well as in many other areas.

Issue of cache misses increases as more concurrent jobs executed on the same platform. Platform not defined.

Where there are multiple jobs using the same graph, detect this to reduce redundant copies in memory. A question I would have with this approach is how that might then affect scheduling of jobs – would it be better to let users indicate that they are running concurrent jobs using the same graph? And cache misses assumes that they are sharing a single node. But the suggestion here is to create a graph storage system with an API. However, I think it’s unlikely that users are likely to change their code to use such a system unless the improvements are particularly clear. However, a system that allows sharing of a single graph across concurrent jobs would be difficult too, so I am not sure which approach would deliver the benefits and be taken up for users – it would always include changes to code and workflow.

GraphM language discussed. This looked at graph partitioning. It wasn’t clear what the decision method for deciding on the partitioning the graph was.

The synchronisation options were talked about. This seemed to be related to loading a partition then doing work on the concurrent jobs related to that partition.

I am not sure how good this approach is compared, to, say, distributing the graph partitions across multiple nodes and having them always active and avoiding swapping between partitions as it presumes that a partition must more-or-less fill the memory for this scheme to make sense, unless it’s talking about a partition sufficient to fit in cache which would make more sense, and might be the case, but it’s not 100% clear from the presentation itself but might be clearer in retrospect or from a paper. Since cache is small then scheduling relative to cache would make sense for some things (although there was always an issue with graphs and relaxation by elimination, to suggest a way it might not work).

The experiments were run on CPUs with relatively low core counts, but 20MB 3rd level cache (LLC – so it is related to cache efficiency).

GridGraph framework – GridGraph-M -S and -C flavours.

The presentation was hard to follow due to audio issues.

GridGraph-M seemed to be the best option.

Suggests no application changes in the conclusion which surprises me so maybe I am misunderstanding something. Typical improvement in total execution times for jobs (16 concurrent) is around 5 to 20% it would seem.

In the future will look at use on FPGA, etc., and security issues.

Semantic Query Transformations for Increased Parallelisation in Distributed Knowledge Graph Query Processing

Initial discussion of semantic relationships, and left-to-right processing in those relationships (domain – i.e. class).

Uses ontologies (in essence a schema).

Entailment rules allowing inference.

(There have been a number of other languages to reason about graphs, so I may look up some links to paste here.)

(Would be interesting to consider graph similarity and mapping in this context).

There are various inference strategies. e.g. forward chain, backward chain, query reformulation – rewrite the query in a more useful way in first-order logic, but this is really hard to optimise the data access for, which is the thrust of this paper presentation.

Query as a union of other queries.

Use SQL type query language

Real-life queries tested over things like biomedical databases which use ontologies (from personal exprience, e.g. SNOMED, etc).

Use graph partitioning and parallelisation – single operator over multiple data paritions – e.g. Hadoop, SPARK.

Shorten execution flow -> less i/o and network data transfer in things like Apache SPARK.

Use of triples (tuples). Makes me wonder about other schemes such as coordination languages like Linda although in this context the tuple defines an object. I could potentially see that operation A creating something in tuple space that is then consumed in another process might be a tempting way to create workflows across processing units for this type of query, especially if it is possible to do it lazily. Whether this would be efficient is another matter.

From triple groups to create a type system? With types that might allow optimisation on types and different indexing schemes. 

The method (Semistorm+) does seem to offer good performance and be very scalable and less storage compared to Hive.

Will need to read the paper too (link pending).

Hive?

Questions: Time issues in relationships (when a relationships is true – when a relationship is true – temporal knowledge graphs). Not looked at this yet.

HD5

MIQS – metadata Indexing and Query System to provide an in-memory, persistent indexing of contents of HD5 files to speed applications.

Use IDs rather than paths.

Use a radix tree for indexing the data.

Self-balancing search tree for fast search.

Use of data structures to ensure elements are grouped for fast access of the meta data index.

Evaluated against MongoDB. 100 HD5 files – 145 GB. 1.5million data objects. Speed: about 60% of that for MongoDB. But is MongoDB the best comparator? I don’t know. This question was asked from the floor too. Not really a satisfying answer here. I would like to see more comparisons.