Fast Correlation Discovery for Large-Scale Streaming Time-Series Data
The dramatic rise of streaming time-series data produced in a vari- ety of contexts, such as stock markets, mobile sensing, sensor net- works, data centre monitoring, etc., has fuelled the development of large-scale distributed real-time computation systems ( e.g., Apache Storm, Spark Streaming, S4, etc.). However, it is still unclear how certain important tasks, which can be performed with relative ease in a centralized system, could be performed using such distributed systems. In this paper, we focus on one such task of continu- ously discovering correlations among a large number of stream- ing time series. While doing so, we address two key challenges: (1) the number of time-series pairs that have to be analyzed grows quadratically (O(n2)) in the number of time-series n, giving rise to a quadratic increase in the communication cost between differ- ent nodes of the distributed system, (2) as the size of the time series grows, the computational and communication costs again increase at a prohibitive rate. To tackle these challenges, we propose an approach referred to as AEGIS. AEGIS approximates a group of streams using affine trans- formations. Then it only communicates these stream groups, which are smaller in size and therefore significantly reduces the communi- cation overhead. Secondly, AEGIS dramatically enhances the com- putational efficiency by exploiting the properties of affine transfor- mations to prune the number of evaluated correlations. As for base- lines we adapt well-known centralized correlation computation ap- proaches to the distributed environment. Our extensive experimen- tal evaluations on real and synthetic datasets establish that AEGIS outperforms the baseline approaches in terms of communication cost, processing latency, and peak capacity.
2014