Compilation Techniques for Incremental Collection Processing

Many map-reduce frameworks as well as NoSQL systems rely on collection programming as their interface of choice due to its rich semantics along with an easily parallelizable set of primitives. Unfortunately, the potential of collection programming is not entirely fulfilled by current systems as they lack efficient incremental view maintenance (IVM) techniques for queries producing large nested results. This comes as a consequence of the fact that the nesting of collections does not enjoy the same algebraic properties underscoring the optimization potential of typical collection processing constructs. We propose the first solution for the efficient incrementalization of collection programming in terms of its core constructs as captured by the positive nested relational calculus (NRC+) on bags (with integer multiplicities). We take an approach based on delta query derivation, whose goal is to generate delta queries which, given a small change in the input, can update the materialized view more efficiently than via recomputation. More precisely, we model the cost of NRC+ operators and classify queries as efficiently incrementalizable if their delta has a strictly lower cost than full re-evaluation. Then, we identify IncNRC+, a large fragment of NRC+ that is efficiently incrementalizable and we provide a semantics-preserving translation that takes any NRC+ query to a collection of IncNRC+ queries. Furthermore, we prove that incrementalmaintenance for NRC+ is within the complexity class NC0 and we showcase how Recursive IVM, a technique that has provided significant speedups over traditional IVM in the case of flat queries, can also be applied to IncNRC+ . Existing systems are also limited wrt. the size of inner collections that they can effectively handle before running into severe performance bottlenecks. In particular, in the face of nested collections with skewed cardinalities developers typically have to undergo a painful process of manual query re-writes in order to ensure that the largest inner collections in their workloads are not impacted by these limitations. To address these issues we developed SLeNDer, a compilation framework that given a nested query generates a set of semantically equivalent (partially) shredded queries that can be efficiently evaluated and incrementalized using state of the art techniques for handling skew and applying delta changes, respectively. The derived queries expose nested collections to the same opportunities for distributing their processing and incrementally updating their contents as those enjoyed by top-level collections, leading on our benchmark to up to 16.8x and 21.9x speedups in terms of offline and online processing, respectively. In order to enable efficient IVM for the increasingly common case of collection programming with functional values as in Links, we also discuss the efficient incrementalization of simplytyped lambda calculi, under the constraint that their primitives are themselves efficiently incrementalizable.


  • Thesis submitted - Forthcoming publication

Related material