Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Efficient Approximate Analytics via Adaptive Context-Conscious Query Processing
 
doctoral thesis

Efficient Approximate Analytics via Adaptive Context-Conscious Query Processing

Sanca, Viktor  
2024

Industry and academia rely on ad-hoc data analysis to extract new value and timely insights. At the same time, the growing data volume presents a challenge for interactive ad-hoc analytics for modern in-memory analytical execution engines. While sampling provides a principled way of reducing the data volume, the overheads of runtime sampling make it impractical. In contrast, strong workload assumptions of offline sampling make fast execution possible but are inflexible for data exploration. In addition, most data growth consists of unstructured data, such as text and images, driven by social media and the widespread availability of multimedia devices. Such data contains human-understandable semantic context and presents a new source of value, especially in conjunction with associated structured data. While representation learning models enable contextual data processing, orchestrating exploratory queries over complex analytical pipelines is a tedious process that leads to inefficient execution and data underutilization.

This thesis aims to address the pressure from data volume, semantic data variety, and evolving hardware capabilities via context-conscious approximate analytical query processing. We abstract the individual operations amenable to logical and physical optimizations informed by the workload-driven real-time execution context. To this end, we design and implement techniques that are individually scalable through hardware-conscious design and efficient by adapting to the query and workload characteristics.

Regarding data volume, we propose a framework for ad-hoc sample construction designed for scale-up analytical engine processing. Decoupling data from the state makes sample materialization a low-overhead side-effect of execution. To reduce the cost of ad-hoc data exploration, we introduce lazy sampling, which opportunistically reuses prior data access and computation, enabling partial sampling at the critical path of execution.

Regarding unstructured semantic data variety, we extend the relational model principles for semantic data processing by separating the embedding models as context providers from the context-free vectors as processing formats. This separation of concerns provides tight integration of unstructured data based on neural vector embeddings in relational analytical engines and data-centric logical and physical optimizations catering to vector-relational processing, similarity operations, and access paths.

This thesis redesigns approximate analytical processing to support interactive data exploration over structured and unstructured data while exploiting hardware and workload-level optimizations. Instead of being limited to static data, query, and workload assumptions, this thesis embraces runtime adaptivity. Overall, our design enables efficient and scalable analytics for ad-hoc workloads by exposing execution primitives amenable to inter and intra-query optimization. As a result, users benefit from simpler and faster data exploration and richer insights from structured and unstructured data, improving the overall data utility and system efficiency.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-9829
Author(s)
Sanca, Viktor  

EPFL

Advisors
Ailamaki, Anastasia  
Jury

Prof. Christoph Koch (président) ; professeure Anastasia Ailamaki (directeur de thèse) ; Prof. Anne-Marie Kermarrec, Prof. Carsten Binnig, Dr Justin Levandoski (rapporteurs)

Date Issued

2024

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2024-08-19

Thesis number

9829

Total of pages

232

Subjects

database management systems

•

analytical query processing

•

data analytics

•

real-time analytics

•

approximate query processing

•

unstructured data

•

query optimization

•

vector-relational data management

•

ML and databases

•

context-conscious analytics

EPFL units
DIAS  
Faculty
IC  
School
IINFCOM  
Doctoral School
EDIC  
Available on Infoscience
August 13, 2024
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/240718
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés