Query-driven indexing in large-scale distributed systems

Efficient and effective search in large-scale data repositories requires complex indexing solutions deployed on a large number of servers. Web search engines such as Google and Yahoo! already rely upon complex systems to be able to return relevant query results and keep processing times within the comfortable sub-second limit. Nevertheless, the exponential growth of the amount of content on the Web poses serious challenges with respect to scalability. Coping with these challenges requires novel indexing solutions that not only remain scalable but also preserve the search accuracy. In this thesis we introduce and explore the concept of query-driven indexing – an index construction strategy that uses caching techniques to adapt to the querying patterns expressed by users. We suggest to abandon the strict difference between indexing and caching, and to build a distributed indexing structure, or a distributed cache, such that it is optimized for the current query load. Our experimental and theoretical analysis shows that employing query-driven indexing is especially beneficial when the content is (geographically) distributed in a Peer-to-Peer network. In such a setting extensive bandwidth consumption has been identified as one of the major obstacles for efficient large-scale search. Our indexing mechanisms combat this problem by maintaining the query popularity statistics and by indexing (caching) intermediate query results that are requested frequently. We present several indexing strategies for processing multi-keyword and XPath queries over distributed collections of textual and XML documents respectively. Experimental evaluations show significant overall traffic reduction compared to the state-of-the-art approaches. We also study possible query-driven optimizations for Web search engine architectures. Contrary to the Peer-to-Peer setting, Web search engines use centralized caching of query results to reduce the processing load on the main index. We analyze real search engine query logs and show that the changes in query traffic that such a results cache induces fundamentally affect indexing performance. In particular, we study its impact on index pruning efficiency. We show that combination of both techniques enables efficient reduction of the query processing costs and thus is practical to use in Web search engines.

Related material