Evaluating Top-k Queries over Incomplete Data Streams
We study the problem of continuous monitoring of top-k queries over multiple non-synchronized streams. Assuming a sliding window model, this general problem has been a well addressed research topic in recent years. However, most approaches assume the streams to be synchronized, that is, all attributes of an object are known simultaneously to the query processing engine, which is a crucial assumption for skyline based approaches that construct a k-skyband of potential top-k items. In many streaming scenarios though, different attributes of an item are reported in separate non-synchronized streams which do not allow for exact score calculations. We present how the traditional notion of object dominance changes in this case such that the k dominance set still includes all and only those objects which have a chance of being among the top-k results in their life time. Based on this, we propose an exact algorithm which builds on generating multiple instances of the same object in a way that enables efficient object pruning. We show that even with object pruning the necessary storage for exact evaluation of top-k queries is linear in the size of the sliding window. As data should reside in main memory to provide fast answers in an online fashion and cope with high stream rates, storing all this data may not be possible with limited resources. We present an approximate algorithm which leverages correlation statistics of pairs of streams to evict more objects while maintaining accuracy. We evaluate the efficiency of our proposed algorithms with extensive experiments.