000224052 001__ 224052
000224052 005__ 20180501105441.0
000224052 0247_ $$2doi$$a10.5075/epfl-thesis-7395
000224052 02470 $$2urn$$aurn:nbn:ch:bel-epfl-thesis7395-8
000224052 02471 $$2nebis$$a10806501
000224052 037__ $$aTHESIS_LIB
000224052 041__ $$aeng
000224052 088__ $$a7395
000224052 245__ $$aDistributed Time Series Analytics
000224052 269__ $$a2017
000224052 260__ $$aLausanne$$bEPFL$$c2017
000224052 300__ $$a165
000224052 336__ $$aTheses
000224052 502__ $$aProf. Boi Faltings (président) ; Prof. Karl Aberer (directeur de thèse) ; Dr Martin Rajman, Prof. Albert Bifet, Dr Thanasis Papaioannou (rapporteurs)
000224052 520__ $$aIn recent years time series data has become ubiquitous thanks to affordable sensors and advances in embedded technology. Large amount of time-series data are continuously produced in a wide spectrum of applications, such as sensor networks, medical monitoring and so on. Availability of such large scale time series data highlights the importance of of scalable data management, efficient querying and analysis. Meanwhile, in the online setting time series carries invaluable information and knowledge about the real-time status of involved entities or monitored phenomena, which calls for online time series data mining for serving timely decision making or event detection. In this thesis we aim to address these important issues pertaining to scalable and distributed analytics techniques for massive time series data. Concretely, this thesis is centered around the following three topics:    As the number of sensors that pervade our lives significantly increases (e.g., environmental sensors, mobile phone sensors, IoT applications, etc.), the efficient management of massive amount of time series from such sensors is becoming increasingly important. The infinite nature of sensor data poses a serious challenge for query processing even in a cloud infrastructure. Traditional raw sensor data management systems based on relational databases lack scalability to accommodate large scale sensor data efficiently. Thus, distributed key-value stores in the cloud are becoming a prime tool to manage sensor data. However, currently there are no techniques for indexing and/or query optimization of the model-view sensor time series data in the cloud. In Chapter 2, we propose an innovative index for modeled segments in key-value stores, namely KVI-index. KVI-index consists of two interval indices on the time and sensor value dimensions respectively, each of which has an in-memory search tree and a secondary list materialized in the key-value store.     The dramatic increase in the availability of data streams fuels the development of many distributed real-time computation engines (e.g., Storm, Samza, Spark Streaming, S4 etc.). In Chapter 3, we focus on a fundamental time series mining task in such a new computation paradigm, namely continuously mining dynamic (lagged) correlations in time series via a distributed real-time computation engine. Correlations reveal the hidden and temporal interactions across time series and are widely used in scientific data analysis, data-driven event detection, finance markets and so on. We propose the P2H framework consisting of a parallelism-partitioning based data shuffling and a hypercube structure based computation pruning method, so as to enhance both the communication and computation efficiency for mining correlations in the distributed context.    In numerous real-world applications large datasets collected from observations and measurements of physical entities are inevitably noisy and contain outliers. The outliers in such large and noisy datasets can dramatically degrade the performance of standard distributed machine learning approaches such as s regression trees. In Chapter 4 we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy datasets. Then we present an adaptive gradient learning method for recurrent neural networks (RNN) to forecast streaming time series in the presence of both outliers and change points.
000224052 6531_ $$atime series data mining
000224052 6531_ $$adistributed computing
000224052 6531_ $$atime series data management
000224052 6531_ $$arecurrent neural network
000224052 6531_ $$arobust regression
000224052 6531_ $$adecision tree
000224052 6531_ $$adata summarization
000224052 700__ $$0245910$$aGuo, Tian$$g211685
000224052 720_2 $$0240941$$aAberer, Karl$$edir.$$g134136
000224052 8564_ $$s8234224$$uhttps://infoscience.epfl.ch/record/224052/files/EPFL_TH7395.pdf$$yn/a$$zn/a
000224052 909C0 $$0252004$$pLSIR$$xU10405
000224052 909CO $$ooai:infoscience.tind.io:224052$$pIC$$pthesis-bn2018$$pDOI2$$pDOI$$pthesis
000224052 917Z8 $$x108898
000224052 917Z8 $$x108898
000224052 917Z8 $$x108898
000224052 918__ $$aIC$$cIINFCOM$$dEDIC
000224052 919__ $$aLSIR
000224052 920__ $$a2017-3-17$$b2017
000224052 970__ $$a7395/THESES
000224052 973__ $$aEPFL$$sPUBLISHED
000224052 980__ $$aTHESIS