Sharing of probabilistically correlated data in peer-to-peer networks
The impact of Peer-to-Peer (P2P) networks on the Internet landscape is undisputed. It has led to a series of new applications, e.g., as part of the so-called Web 2.0. The shift from the classical client-server based paradigm of the Internet, with a clear distinction between information providers and consumers, towards consumers sharing information among each other led to the rise of the P2P paradigm. The distributed setting enables users to share their content autonomously and locally, i.e, information remains at computers at the edge of the Internet and is not gathered and organized at central servers. Structured overlay networks were designed to organize the huge quantity of data shared in P2P networks by building a global, though distributed, index of shared information. Whereas the initial aim of these systems was to provide efficient lookup operations for single keyword operations, the need for more complex operations emerged very quickly. Peer Data Management Systems (PDMS) is one such example of application that enables data integration and complex query processing similar to (distributed) database systems on top of structured overlays. Complex query operators usually require joint access to multiple data entries whereas single key lookups usually only affect a single data entry. The partitioning of the distributed index of standard structured overlays is optimized towards single key lookups and joint data access as required by PDMS was neglected so far. (Distributed) databases have already shown that (index-)data organization supporting correlated data access is necessary and crucial for efficient processing, as network usage is minimized. We aim at applying this insight to structured overlays by clustering correlated data frequently accessed jointly by applications, including PDMS but also other types of applications. Data correlations can be derived from different sources, data properties, processing properties, users and applications. We study and present solutions for three different types of correlations in the context of different applications: (i) range queries where exploiting the order relationship of data is a fundamental basis for any database-like system; (ii) distributed probabilistic inference where exploiting user-defined complex data correlations gains importance through the Web 2.0; (iii) multi-term queries where exploiting data correlations derived from data access statistics enables simple but powerful keyword search to users. Our approach exploits properties of structured overlays, such as order-preserving hashing, to realize efficient range query processing. We further introduce a distributed clustering algorithm based on the spring relaxation technique to cluster strongly correlated data entries at one node respectively in its proximity. Joint data access is thus performed on a single or few nodes to reduce network bandwidth consumption and therefore to increase system performance. Our approaches are realized on top of the structured overlay network P-Grid although they are generic enough to be applied to other P2P networks with similar properties. This thesis presents details about the Java implementation of P-Grid, its architectural design and its evaluation on PlanetLab, today's standard testbed for P2P systems. The implementation of P-Grid in a Java application finally enabled us to build a PDMS system on top relying on P-Grid's efficient query processing. We present the UniStore system, a distributed data management system aiming at public data management and support for database-like query operators on heterogeneous data, and its evaluation on PlanetLab.
EPFL_TH4133.pdf
openaccess
2.74 MB
Adobe PDF
e5421f9393e4fde1cce4097d97e2a56b