Time- and Space-Efficient Spatial Data Analytics

Pavlovic, Mirjana

doi:10.5075/epfl-thesis-9130

Pavlovic, Mirjana

2019

Download

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Advances in data acquisition technologies and supercomputing for large-scale simulations have led to an exponential growth in the volume of spatial data. This growth is accompanied by an increase in data complexity, such as spatial density, but also by more varied data distributions. As data evolves, so do the needs of applications. Recently, we notice a shift from predefined to ad-hoc workloads, as a result of the recent data exploration trend among data-driven applications. At the same time, given the massive volume of data, it has become imperative to use computational and storage resources efficiently, where efficiency requirements typically vary across applications. In this thesis, we show that traditional spatial data management techniques underperform as data size and complexity increase: they waste both computational and storage resources. They are also inefficient in supporting ad-hoc workloads. To achieve time- and space-efficiency, we design spatial data management algorithms and storage layouts that leverage and adapt to data characteristics and workload access patterns. In particular, we revisit the design of spatial join algorithms, indexing techniques and point cloud data management solutions. First, we propose data-aware spatial joins that leverage and adapt to dataset characteristics to avoid wasting computational resources and achieve time-efficiency on non-uniform data distributions. GIPSY is designed to efficiently join two datasets with contrasting densities. GIPSY uses the sparser dataset to guide the join process and therefore, by leveraging dataset characteristics, selectively retrieves and joins only the data needed. TRANSFORMERS achieves robust performance and time-efficiency on non-uniform data distributions, by adapting to dataset characteristics. It detects local variations in distributions and adapts the join strategy and data layout to local dataset characteristics at run-time. We next introduce incremental indexing approaches that take into account workload access patterns. This way, they minimize the data-to-insight time and avoid unnecessary preprocessing costs, substantially accelerating the exploratory analysis of spatial data. Incremental indexes are built as a side-effect of query execution and only for the parts of the data queried. Space Odyssey is designed for exploratory analyses of multiple spatial datasets that reside on disk. It takes advantage of workload access patterns to incrementally index the datasets and optimize accesses to parts frequently queried together. QUASII supports spatial data exploration in main memory. QUASII reduces the data-to-insight time and curbs the cost of incremental indexing, by gradually and partially sorting the data, while simultaneously producing a data-oriented hierarchical structure. Finally, we propose a time- and space-efficient solution to storing and managing point cloud data in main memory column-store database management systems. Our approach leverages point cloud data properties to employ dictionary-based compression in the spatial data management domain and enhances it with indexing capabilities by using space-filling curves. The proposed scheme also represents a partitioning strategy. It is a middle ground between data- and space-oriented partitioning, accounting for the data distribution, while preserving the simplicity of grid-like structures.

Details

Title Time- and Space-Efficient Spatial Data Analytics

Author(s) Pavlovic, Mirjana

Advisor(s)

Ailamaki, Anastasia

Pagination 183

Date 2019

Publisher Lausanne, EPFL

Keywords

data management; database management systems; scientific data management; spatial data management; spatial data analytics; data exploration; spatial data compression; multidimensional data access methods; spatial joins; incremental indexing

Language English

DOI https://doi.org/10.5075/epfl-thesis-9130

Laboratories DIAS

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > DIAS - Data-Intensive Applications and Systems Laboratory
Scientific production and competences > EPFL Theses
Work produced at EPFL
Published
Theses

Record creation date 2019-01-31

Actions

Preview

Select file: