Many map-reduce frameworks as well as NoSQL systems rely on collection programming as
their interface of choice due to its rich semantics along with an easily parallelizable set of
primitives. Unfortunately, the potential of collection programming is not entirely fulfilled
by current systems as they lack efficient incremental view maintenance (IVM) techniques
for queries producing large nested results. This comes as a consequence of the fact that
the nesting of collections does not enjoy the same algebraic properties underscoring the
optimization potential of typical collection processing constructs.
We propose the first solution for the efficient incrementalization of collection programming
in terms of its core constructs as captured by the positive nested relational calculus (NRC+)
on bags (with integer multiplicities). We take an approach based on delta query derivation,
whose goal is to generate delta queries which, given a small change in the input, can update
the materialized view more efficiently than via recomputation. More precisely, we model the
cost of NRC+ operators and classify queries as efficiently incrementalizable if their delta has
a strictly lower cost than full re-evaluation. Then, we identify IncNRC+, a large fragment of
NRC+ that is efficiently incrementalizable and we provide a semantics-preserving translation
that takes any NRC+ query to a collection of IncNRC+ queries. Furthermore, we prove that
incrementalmaintenance for NRC+ is within the complexity class NC0 and we showcase how
Recursive IVM, a technique that has provided significant speedups over traditional IVM in the
case of flat queries, can also be applied to IncNRC+ .
Existing systems are also limited wrt. the size of inner collections that they can effectively
handle before running into severe performance bottlenecks. In particular, in the face of nested
collections with skewed cardinalities developers typically have to undergo a painful process of
manual query re-writes in order to ensure that the largest inner collections in their workloads
are not impacted by these limitations.
To address these issues we developed SLeNDer, a compilation framework that given a nested
query generates a set of semantically equivalent (partially) shredded queries that can be
efficiently evaluated and incrementalized using state of the art techniques for handling skew
and applying delta changes, respectively. The derived queries expose nested collections to
the same opportunities for distributing their processing and incrementally updating their
contents as those enjoyed by top-level collections, leading on our benchmark to up to 16.8x
and 21.9x speedups in terms of offline and online processing, respectively.
In order to enable efficient IVM for the increasingly common case of collection programming
with functional values as in Links, we also discuss the efficient incrementalization of simplytyped
lambda calculi, under the constraint that their primitives are themselves efficiently
incrementalizable.
EPFL_TH8019.pdf
openaccess
2.25 MB
Adobe PDF
cf0a10ecb7269f41e701788bd17cb0fc