Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Efficient Approximate Analytics via Adaptive Context-Conscious Query Processing
 
doctoral thesis

Efficient Approximate Analytics via Adaptive Context-Conscious Query Processing

Sanca, Viktor  
2024

Industry and academia rely on ad-hoc data analysis to extract new value and timely insights. At the same time, the growing data volume presents a challenge for interactive ad-hoc analytics for modern in-memory analytical execution engines. While sampling provides a principled way of reducing the data volume, the overheads of runtime sampling make it impractical. In contrast, strong workload assumptions of offline sampling make fast execution possible but are inflexible for data exploration. In addition, most data growth consists of unstructured data, such as text and images, driven by social media and the widespread availability of multimedia devices. Such data contains human-understandable semantic context and presents a new source of value, especially in conjunction with associated structured data. While representation learning models enable contextual data processing, orchestrating exploratory queries over complex analytical pipelines is a tedious process that leads to inefficient execution and data underutilization.

This thesis aims to address the pressure from data volume, semantic data variety, and evolving hardware capabilities via context-conscious approximate analytical query processing. We abstract the individual operations amenable to logical and physical optimizations informed by the workload-driven real-time execution context. To this end, we design and implement techniques that are individually scalable through hardware-conscious design and efficient by adapting to the query and workload characteristics.

Regarding data volume, we propose a framework for ad-hoc sample construction designed for scale-up analytical engine processing. Decoupling data from the state makes sample materialization a low-overhead side-effect of execution. To reduce the cost of ad-hoc data exploration, we introduce lazy sampling, which opportunistically reuses prior data access and computation, enabling partial sampling at the critical path of execution.

Regarding unstructured semantic data variety, we extend the relational model principles for semantic data processing by separating the embedding models as context providers from the context-free vectors as processing formats. This separation of concerns provides tight integration of unstructured data based on neural vector embeddings in relational analytical engines and data-centric logical and physical optimizations catering to vector-relational processing, similarity operations, and access paths.

This thesis redesigns approximate analytical processing to support interactive data exploration over structured and unstructured data while exploiting hardware and workload-level optimizations. Instead of being limited to static data, query, and workload assumptions, this thesis embraces runtime adaptivity. Overall, our design enables efficient and scalable analytics for ad-hoc workloads by exposing execution primitives amenable to inter and intra-query optimization. As a result, users benefit from simpler and faster data exploration and richer insights from structured and unstructured data, improving the overall data utility and system efficiency.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH9829.pdf

Type

Main Document

Version

http://purl.org/coar/version/c_be7fb7dd8ff6fe43

Access type

openaccess

License Condition

N/A

Size

2.48 MB

Format

Adobe PDF

Checksum (MD5)

89b775202b0ed388ce6619967e68791f

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés