Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test

Klemm, Fabius; Aberer, Karl

conference paper

Klemm, Fabius

•

Aberer, Karl

2005

LNCS

Third International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2005)

There has been an increasing research interest in developing full-text retrieval based on peer-to-peer (P2P) technology. So far, these research efforts have largely concentrated on efficiently distributing an index. However, ranking of the results retrieved from the index is a crucial part in information retrieval. To determine the relevance of a document to a query, ranking algorithms use collection-wide statistics. Term frequency - inverse document frequency (TFIDF), for example, is based on frequencies of documents containing a given term in the whole collection. Such global frequencies are not readily available in a distributed system. In this paper, we study the feasibility of aggregating global frequencies for a large term vocabulary in a P2P setting. We use a distributed hash table (DHT) for our analysis. Traditional applications of DHTs, such as file sharing, index keys in the order of tens of thousands. Aggregation of a vocabulary consisting of millions of terms poses extreme requirements to a DHT implementation. We study different aggregation strategies and propose optimizations to DHTs to efficiently process large numbers of keys.

Name

klemm.pdf

Access type

openaccess

Size

174.7 KB

Format

Adobe PDF

Checksum (MD5)

ad1a36d7fe60176322fa6b13bb69d44b