Efficient Incremental Data Analysis

Nikolic, Milos

doi:10.5075/epfl-thesis-7183

Nikolic, Milos

2016

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Many data-intensive applications require real-time analytics over streaming data. In a growing number of domains -- sensor network monitoring, social web applications, clickstream analysis, high-frequency algorithmic trading, and fraud detections to name a few -- applications continuously monitor stream events to promptly react to certain data conditions. These applications demand responsive analytics even when faced with high volume and velocity of incoming changes, large numbers of users, and complex processing requirements. Developing suitable online analytics engine that meets these requirements is challenging. In this thesis, we study techniques for efficient online processing of complex analytical queries, ranging from standard database queries to complex machine learning and digital signal processing workflows. First, we focus on the problem of efficient incremental computation for database queries. We have developed a system, called DBToaster, that compiles declarative queries into high-performance stream processing engines that keep query results (views) fresh at very high update rates. At the heart of our system is a recursive query compilation algorithm that materializes a set of supporting higher-order delta views to achieve a substantially lower view maintenance cost. We study the trade-offs between single-tuple and batch incremental processing in local execution, and we present a novel approach for compiling view maintenance code into data-parallel programs optimized for distributed execution. DBToaster supports millions of complete view refreshes per second for a broad range of queries and outperforms commercial database and stream engines by orders of magnitude. We also study the incremental computation for queries written as iterative linear algebra, which can capture many machine learning and scientific calculations. We have developed a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change and make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our last research question concerns the integration of general-purpose query processors and domain-specific operations to enable deep data exploration in both online and offline analysis. We advocate a deep integration of signal processing operations and general-purpose query processors. We demonstrate that in-situ processing of tempo-relational and signal data through a unified query language empowers users to express end-to-end workflows more succinctly inside one system while at the same time offering orders of magnitude better performance than existing popular data management systems.

Details

Title Efficient Incremental Data Analysis

Author(s) Nikolic, Milos

Advisor(s)

Koch, Christoph

Pagination 181

Date 2016

Publisher Lausanne, EPFL

Keywords

incremental view maintenance (IVM); materialized views; databases; stream processing; batch processing; higher-order IVM; distributed IVM; DBToaster; incremental linear algebra; compilation; signal processing

Language English

DOI https://doi.org/10.5075/epfl-thesis-7183

Other identifier(s) urn: urn:nbn:ch:bel-epfl-thesis7183-6

Laboratories DATA

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > DATA - Data Analysis Theory and Applications Laboratory
Scientific production and competences > EPFL Theses
Work produced at EPFL
Published
Theses

Record creation date 2016-08-17

Files

Abstract

Details

PDF