000231939 001__ 231939
000231939 005__ 20190619041446.0
000231939 0247_ $$2doi$$a10.5075/epfl-thesis-7731
000231939 02470 $$2urn$$aurn:nbn:ch:bel-epfl-thesis7731-9
000231939 02471 $$2nebis$$a11056972
000231939 037__ $$aTHESIS
000231939 041__ $$aeng
000231939 088__ $$a7731
000231939 245__ $$aEfficient Online Processing for Advanced Analytics
000231939 260__ $$bEPFL$$c2017$$aLausanne
000231939 269__ $$a2017
000231939 300__ $$a146
000231939 336__ $$aTheses
000231939 502__ $$aProf. Willy Zwaenepoel (président) ; Prof. Christoph Koch (directeur de thèse) ; Prof. Karl Aberer, Prof. Markus Püschel, Dr Carlo Curino (rapporteurs)
000231939 520__ $$aWith the advent of emerging technologies and the Internet of Things, the importance of online data analytics has become more pronounced. Businesses and companies are adopting approaches that provide responsive analytics to stay competitive in the global marketplace. Online analytics allow data analysts to promptly react to patterns or to gain preliminary insights from early results that aid in research, decision making, and effective strategy planning. The growth of data-velocity in a variety of domains including, high-frequency trading, social networks, infrastructure monitoring, and advertising require adopting online engines that can efficiently process continuous streams of data.  This thesis presents foundations, techniques, and systems' design that extend the state-of-the-art in online query processing to efficiently support relational joins with arbitrary join-predicates (beyond traditional equi-joins); and to support other data models (beyond relational) that target machine learning and graph computations. The thesis is divided into two parts:  We first present a brief overview of Squall, our open-source online query processing engine that supports SQL-like queries on top of streams. Then, we focus on extending Squall to support efficient theta-join processing. Scalable distributed join processing requires a partitioning policy that evenly distributes the processing load while minimizing the size of maintained state and duplicated messages. Efficient load-balance demands apriori-statistics which are not available in the online setting. We propose a novel operator that continuously adjusts itself to the data dynamics, through adaptive dataflow routing and state repartitioning. It is also resilient to data-skew, maintains high throughput rates, avoids blocking during state repartitioning, and behaves as a black-box dataflow operator with provable performance guarantees. Our evaluation demonstrates that the proposed operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time up to 7x. In the second part, we present a novel framework that supports the Incremental View Maintenance (IVM) of workloads expressed as linear algebra programs. Linear algebra represents a concrete substrate for advanced analytical tasks including, machine learning, scientific computation, and graph algorithms. Previous works on relational calculus IVM are not applicable to matrix algebra workloads. This is because a single entry change to an input-matrix results in changes all over the intermediate views, rendering IVM useless in comparison to re-evaluation. We present Lago, a unified modular compiler framework that supports the IVM of a broad class of linear algebra programs. Lago automatically derives and optimizes incremental trigger programs of analytical computations, while freeing the user from erroneous manual derivations, low-level implementation details, and performance tuning. We present a novel technique that captures $\Delta$ changes as low-rank matrices. Low-rank matrices are representable in a compressed factored form that enables cheaper computations. Lago automatically propagates the factored representation across program statements to derive an efficient trigger program. Moreover, Lago extends its support to other domains that use different semi-ring configurations, e.g., graph applications. Our evaluation results demonstrate orders of magnitude (10x-10
000231939 6531_ $$aonline query engine
000231939 6531_ $$atheta-joins
000231939 6531_ $$aefficient joins
000231939 6531_ $$askew-resilience
000231939 6531_ $$aadaptivity
000231939 6531_ $$amatrix algebra
000231939 6531_ $$aincremental view maintenance
000231939 6531_ $$arewrite systems
000231939 6531_ $$aincremental computation
000231939 6531_ $$acompiler optimization
000231939 700__ $$0246550$$g213525$$aEl Seidy, Mohamed Elsayed Mohamed Ahmed
000231939 720_2 $$aKoch, Christoph$$edir.$$g205917$$0244689
000231939 8564_ $$uhttps://infoscience.epfl.ch/record/231939/files/EPFL_TH7731.pdf$$zn/a$$s5314731$$yn/a
000231939 909C0 $$xU12327$$0252342$$pDATA
000231939 909CO $$pthesis-public$$pDOI$$pIC$$ooai:infoscience.tind.io:231939$$qGLOBAL_SET$$pthesis$$pthesis-bn2018$$qDOI2
000231939 917Z8 $$x108898
000231939 918__ $$dEDIC$$cIINFCOM$$aIC
000231939 919__ $$aDATA
000231939 920__ $$b2017$$a2017-10-26
000231939 970__ $$a7731/THESES
000231939 973__ $$sPUBLISHED$$aEPFL
000231939 980__ $$aTHESIS