Efficient Lineage Tracking for Scientific Workflows

Heinis, Thomas; Alonso, Gustavo

doi:10.1145/1376616.1376716

2008

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

Data lineage and data provenance are key to the management of scientific data. Not knowing the exact provenance and processing pipeline used to produce a derived data set often renders the data set useless from a scientific point of view. On the positive side, capturing provenance information is facilitated by the widespread use of workflow tools for processing scientific data. The workflow process describes all the steps involved in producing a given data set and, hence, captures its lineage. On the negative side, efficiently storing and querying workflow based data lineage is not trivial. All existing solutions use recursive queries and even recursive tables to represent the workflows. Such solutions do not scale and are rather inefficient. In this paper we propose an alternative approach to storing lineage information captured as a workflow process. We use a space and query efficient interval representation for dependency graphs and show how to transform arbitrary workflow processes into graphs that can be stored using such representation. We also characterize the problem in terms of its overall complexity and provide a comprehensive performance evaluation of the approach.

Details

Title Efficient Lineage Tracking for Scientific Workflows

Author(s) Heinis, Thomas ; Alonso, Gustavo

Published in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD'08)

Pages 1007-1018

Conference SIGMOD '08, Vancouver, Canada, June

Date 2008

Publisher ACM

ISBN 978-1-60558-102-6

DOI https://doi.org/10.1145/1376616.1376716

Laboratories DIAS

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > DIAS - Data-Intensive Applications and Systems Laboratory
Work outside EPFL
Conference Papers
Published

Record creation date 2009-09-07

Abstract

Details

Actions