Efficient Lineage Tracking for Scientific Workflows

Heinis, Thomas; Alonso, Gustavo

doi:10.1145/1376616.1376716

conference paper

Efficient Lineage Tracking for Scientific Workflows

Heinis, Thomas

•

Alonso, Gustavo

2008

Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD'08)

SIGMOD '08

Data lineage and data provenance are key to the management of scientific data. Not knowing the exact provenance and processing pipeline used to produce a derived data set often renders the data set useless from a scientific point of view. On the positive side, capturing provenance information is facilitated by the widespread use of workflow tools for processing scientific data. The workflow process describes all the steps involved in producing a given data set and, hence, captures its lineage. On the negative side, efficiently storing and querying workflow based data lineage is not trivial. All existing solutions use recursive queries and even recursive tables to represent the workflows. Such solutions do not scale and are rather inefficient.

In this paper we propose an alternative approach to storing lineage information captured as a workflow process. We use a space and query efficient interval representation for dependency graphs and show how to transform arbitrary workflow processes into graphs that can be stored using such representation. We also characterize the problem in terms of its overall complexity and provide a comprehensive performance evaluation of the approach.

Type

conference paper

DOI

10.1145/1376616.1376716

Authors

Heinis, Thomas

•

Alonso, Gustavo

Publication date

2008

Publisher

ACM

Published in

Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD'08)

ISBN of the book

978-1-60558-102-6

Start page

1007

End page

1018

Peer reviewed

NON-REVIEWED

EPFL units

DIAS

Event name	Event place	Event date
SIGMOD '08	Vancouver, Canada	June

Available on Infoscience

September 7, 2009

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/42476