Scaling Similarity Joins over Tree-Structured Data

Tang, Yu; Cai, Yilun; Mamoulis, Nikos

doi:10.14778/2809974.2809976

2015

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

Given a large collection of tree-structured objects (e.g., XML documents), the similarity join finds the pairs of objects that are similar to each other, based on a similarity threshold and a tree edit distance measure. The state-of-the-art similarity join methods compare simpler approximations of the objects (e.g., strings), in order to prune pairs that cannot be part of the similarity join result based on distance bounds derived by the approximations. In this paper, we propose a novel similarity join approach, which is based on the dynamic decomposition of the tree objects into subgraphs, according to the similarity threshold. Our technique avoids computing the exact distance between two tree objects, if the objects do not share at least one common subgraph. In order to scale up the join, the computed subgraphs are managed in a two-layer index. Our experimental results on real and synthetic data collections show that our approach outperforms the state-of-the-art methods by up to an order of magnitude.

Details

Title Scaling Similarity Joins over Tree-Structured Data

Author(s) Tang, Yu ; Cai, Yilun ; Mamoulis, Nikos

Published in Proceedings Of The Vldb Endowment

Pagination 12

Volume 8

Issue 11

Pages 1130-1141

Date 2015

Publisher New York, Assoc Computing Machinery

ISSN 2150-8097

DOI https://doi.org/10.14778/2809974.2809976

Other identifier(s) View record in Web of Science

Laboratories IINFCOM

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > UNATTRIBUTED-IINFCOM - IINFCOM - Unattributed publications
Peer-reviewed publications
Work produced at EPFL
Journal Articles
Published

Record creation date 2015-12-02