000128890 001__ 128890
000128890 005__ 20181205220158.0
000128890 0247_ $$2doi$$a10.5075/epfl-thesis-4280
000128890 02470 $$2urn$$aurn:nbn:ch:bel-epfl-thesis4280-5
000128890 02471 $$2nebis$$a5686818
000128890 037__ $$aTHESIS
000128890 041__ $$aeng
000128890 088__ $$a4280
000128890 245__ $$aQuery-driven indexing in large-scale distributed systems
000128890 269__ $$a2009
000128890 260__ $$aLausanne$$bEPFL$$c2009
000128890 300__ $$a162
000128890 336__ $$aTheses
000128890 520__ $$aEfficient and effective search in large-scale data repositories requires complex indexing solutions deployed on a large number of servers. Web search engines such as Google and Yahoo! already rely upon complex systems to be able to return relevant query results and keep processing times within the comfortable sub-second limit. Nevertheless, the exponential growth of the amount of content on the Web poses serious challenges with respect to scalability. Coping with these challenges requires novel indexing solutions that not only remain scalable but also preserve the search accuracy. In this thesis we introduce and explore the concept of query-driven indexing – an index construction strategy that uses caching techniques to adapt to the querying patterns expressed by users. We suggest to abandon the strict difference between indexing and caching, and to build a distributed indexing structure, or a distributed cache, such that it is optimized for the current query load. Our experimental and theoretical analysis shows that employing query-driven indexing is especially beneficial when the content is (geographically) distributed in a Peer-to-Peer network. In such a setting extensive bandwidth consumption has been identified as one of the major obstacles for efficient large-scale search. Our indexing mechanisms combat this problem by maintaining the query popularity statistics and by indexing (caching) intermediate query results that are requested frequently. We present several indexing strategies for processing multi-keyword and XPath queries over distributed collections of textual and XML documents respectively. Experimental evaluations show significant overall traffic reduction compared to the state-of-the-art approaches. We also study possible query-driven optimizations for Web search engine architectures. Contrary to the Peer-to-Peer setting, Web search engines use centralized caching of query results to reduce the processing load on the main index. We analyze real search engine query logs and show that the changes in query traffic that such a results cache induces fundamentally affect indexing performance. In particular, we study its impact on index pruning efficiency. We show that combination of both techniques enables efficient reduction of the query processing costs and thus is practical to use in Web search engines.
000128890 6531_ $$alarge-scale systems
000128890 6531_ $$aquery-driven indexing
000128890 6531_ $$aP2P
000128890 6531_ $$aP2PIR
000128890 6531_ $$ainformation retrieval
000128890 6531_ $$aXML
000128890 6531_ $$aWeb search engines
000128890 6531_ $$acaching
000128890 6531_ $$aindex pruning
000128890 6531_ $$asystèmes de grande taille
000128890 6531_ $$aindexation guidée par les requêtes
000128890 6531_ $$aP2P
000128890 6531_ $$aP2PIR
000128890 6531_ $$arecherche d'information
000128890 6531_ $$aXML
000128890 6531_ $$amoteur de recherche Web
000128890 6531_ $$acache
000128890 6531_ $$aélagage d'index
000128890 700__ $$0240003$$aSkobeltsyn, Gleb$$g155081
000128890 720_2 $$0240941$$aAberer, Karl$$edir.$$g134136
000128890 8564_ $$s3740677$$uhttps://infoscience.epfl.ch/record/128890/files/EPFL_TH4280.pdf$$yTexte intégral / Full text$$zTexte intégral / Full text
000128890 909C0 $$0252004$$pLSIR$$xU10405
000128890 909CO $$ooai:infoscience.tind.io:128890$$pthesis-bn2018$$pDOI$$pIC$$pthesis$$qDOI2$$qGLOBAL_SET
000128890 918__ $$aIC$$bIC-SSC$$cIIF
000128890 919__ $$aLSIR
000128890 920__ $$b2009
000128890 970__ $$a4280/THESES
000128890 973__ $$aEPFL$$sPUBLISHED
000128890 980__ $$aTHESIS