Efficient Approximate Analytics via Adaptive Context-Conscious Query Processing
Industry and academia rely on ad-hoc data analysis to extract new value and timely insights. At the same time, the growing data volume presents a challenge for interactive ad-hoc analytics for modern in-memory analytical execution engines. While sampling provides a principled way of reducing the data volume, the overheads of runtime sampling make it impractical. In contrast, strong workload assumptions of offline sampling make fast execution possible but are inflexible for data exploration. In addition, most data growth consists of unstructured data, such as text and images, driven by social media and the widespread availability of multimedia devices. Such data contains human-understandable semantic context and presents a new source of value, especially in conjunction with associated structured data. While representation learning models enable contextual data processing, orchestrating exploratory queries over complex analytical pipelines is a tedious process that leads to inefficient execution and data underutilization.
This thesis aims to address the pressure from data volume, semantic data variety, and evolving hardware capabilities via context-conscious approximate analytical query processing. We abstract the individual operations amenable to logical and physical optimizations informed by the workload-driven real-time execution context. To this end, we design and implement techniques that are individually scalable through hardware-conscious design and efficient by adapting to the query and workload characteristics.
Regarding data volume, we propose a framework for ad-hoc sample construction designed for scale-up analytical engine processing. Decoupling data from the state makes sample materialization a low-overhead side-effect of execution. To reduce the cost of ad-hoc data exploration, we introduce lazy sampling, which opportunistically reuses prior data access and computation, enabling partial sampling at the critical path of execution.
Regarding unstructured semantic data variety, we extend the relational model principles for semantic data processing by separating the embedding models as context providers from the context-free vectors as processing formats. This separation of concerns provides tight integration of unstructured data based on neural vector embeddings in relational analytical engines and data-centric logical and physical optimizations catering to vector-relational processing, similarity operations, and access paths.
This thesis redesigns approximate analytical processing to support interactive data exploration over structured and unstructured data while exploiting hardware and workload-level optimizations. Instead of being limited to static data, query, and workload assumptions, this thesis embraces runtime adaptivity. Overall, our design enables efficient and scalable analytics for ad-hoc workloads by exposing execution primitives amenable to inter and intra-query optimization. As a result, users benefit from simpler and faster data exploration and richer insights from structured and unstructured data, improving the overall data utility and system efficiency.
EPFL_TH9829.pdf
Main Document
http://purl.org/coar/version/c_be7fb7dd8ff6fe43
openaccess
N/A
2.48 MB
Adobe PDF
89b775202b0ed388ce6619967e68791f