Emergent semantics: rethinking interoperability for large scale decentralized information systems

Cudré-Mauroux, Philippe

doi:10.5075/epfl-thesis-3690

Cudré-Mauroux, Philippe

2007

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

In the past, the problem of semantic interoperability in information systems was mostly solved by means of centralization, both at a system and at a logical level. This approach has been successful to a certain extent, but offers limited scalability and flexibility. Peer-to-Peer systems as a new brand of system architectures indicate that the principles of decentralization and self-organization might offer new solutions to many problems that scale well to very large numbers of users, or to systems where central authorities do not prevail. Therefore, we suggest a new way of building global agreements, i.e., semantic interoperability, based on decentralized, self-organizing interactions only. In the first part of this thesis, we discuss traditional data integration techniques relying on global schemas, perfect schema mappings and contained query rewritings. We elaborate on the current ecology of the World Wide Web, where autonomous information sources come and go in dynamic and unpredictable ways. In the current environment, data, schemas and schema mappings can all be generated without human intervention and get encoded in syntactic structures with limited expressivity. We argue that traditional top-down integration techniques are inapplicable to that new context and propose a new integration architecture based on decentralized mappings and dynamic self-organization. In the second part of this thesis, we propose a set of principles to foster semantic interoperability in very large scale information systems. We start by introducing new metrics for the schema mappings, based on both syntactic losses (completeness) and semantic mismatch (soundness) to selectively reformulate queries in a decentralized network of heterogeneous parties. We detail analytical methods to evaluate our metrics, and show how to take advantage of those methods to gradually alleviate mapping inconsistencies across the network. We describe a totally decentralized message passing scheme using belief propagation on transitive closures of schema mapping operations to efficiently evaluate the degree of semantic mismatch between pairs of acquainted information systems. Finally, we propose a graph-theoretic analysis of the network of mappings to quantify the quality of the global agreement that can be achieved in that way. The third and last part of this thesis is devoted to the presentation of two systems illustrating the practical applicability of our ideas. The first system we introduce, GridVine, is a Semantic Overlay Network supporting decentralized data integration techniques through pairwise schema mappings and monotonic schema inheritance. GridVine follows the principle of data independence by separating a logical layer, the semantic overlay for managing and mapping data and schemas, from a physical layer consisting of a self-organizing Peer-to-Peer overlay network for efficient routing of messages. The second system, called PicShark, takes advantage of semi-structured metadata to meaningfully share pictures in collaborative settings. PicShark builds on our principles to dynamically create both annotations and mappings, and to gradually minimize information entropy – in terms of missing metadata and schematic heterogeneity – in a self-organizing and decentralized context. Throughout this thesis, we advocate a holistic view on semantics in large-scale information systems: we model semantics as bottom-up and dynamic agreements among heterogeneous parties. We consider both the representation of semantics and the discovery of the interpretation of symbols as the result of a self-organizing process performed by distributed agents whose utility functions depend on the proper interpretation of the symbols. Our view sharply contrasts with previous top-down contributions analyzing data sources in isolation or focusing on global vocabularies and rigid sets of interpretations curated off-line. In a world where digital information is abundant but human attention remains scarce, we believe that autonomous, best-effort processes such as the ones proposed throughout this thesis will play an ever increasing role in complementing traditional top-down integration approaches to handle massive amounts of digitalized and heterogeneous information assets.