One of the main differences between modern search engines and traditional ones is the adoption of link-based ranking algorithm in ordering Web documents. Google has claimed that it is its link-based ranking algorithm, PageRank that has made the quality of its search results superior. However, existing link-based ranking algorithms are mostly logically centralized although techniques of distributed data management and load balancing have been used in search engines to improve the performance and availability of the systems. With the scale of the Web becoming larger and larger, the scalability of traditional search engines is very much limited by the current centralized algorithms, models, and architectures. The centralization in link-based ranking computation has brought in several critical problems that are hard to solve: centralized bottleneck and single point of failure, prohibitively high computation cost, lag in availability of fresh information, etc. In response to these problems, we have explored several approaches to address them: incremental computation, new distributed model based algorithms, and a common decentralized framework. In particular, we have proposed the new distributed approach based on a Layered Markov Model to distinguish stochastic transitions among Web sites and Web documents. Based on this model, we propose two different methods for computation of ranking of Web documents, a centralized one and a decentralized one. Both produce a well-defined ranking for a given Web graph. We then formally prove that the two approaches are equivalent. This provides a theoretical foundation for decomposing link-based rank computation and makes the computation for a Web-scale graph feasible in a decentralized fashion, such as required for Web search engines having a Peer-to-Peer (P2P) architecture. Furthermore, personalized rankings can be produced by adapting the computation at both the local layer and the global layer. The empirical results show that the ranking generated by our model is qualitatively comparable to or even better than the ranking produced by PageRank. On the other hand, the common framework for Peer-to-Peer Information Retrieval (IR) is situated in a more general scope. It captures the essence of individual models and describes the overall behavior of P2P Web retrieval systems. Such a framework characterizes notions like ranking, ranking aggregation and combination, etc. Algebraic properties of them are explored and guidelines for ranking computation based on those properties are proposed. Application scenarios are then given to show how different search engines and their algorithms can be formalized within this common framework. As two representative examples, heuristic and model-guided distributed ranking compositions are elaborated to demonstrate the power and potentials of this common framework. Results of my research work can be used in modern search companies to build a truly distributed search engine system; they also form a solid base for future research on topics of distributed ranking computation for P2P Web retrieval, including specific layered decentralized algorithms, and problems within the general common framework.
EPFL_TH3361.pdf
restricted
21.69 MB
Adobe PDF
3e6380de9dad586743f0b19c9b51eff0