Infoscience

Thesis

Understanding the Web

The World Wide Web is one of the most widely used information resources. Understanding the web better will enable us to benefit more of it. In this thesis we develop techniques to learn the properties of the web pages like language and topic using only the URLs of web pages. Furthermore we make a comparison and evaluation of web page sampling algorithms to learn about the web properties like content length, top level domain and outdegree distribution. In the first part of this thesis, we develop high performance classifiers for web page language classification using only the URL of web pages. We make a comprehensive study of features and algorithms and test the performance of our classifiers on various real data sets. For language classification the quality of our URL-based classifiers rival the quality of classifiers based on content. Language classification from URL is useful when the content of the web page is not available and when the classification speed is important. Language classifiers based on URLs can be used by crawlers of general and language specific search engines to avoid bandwidth waste. In the second part of this thesis, we investigate whether web page topic classification can be done only with URL. We explore this problem in various dimensions like experimenting with different algorithms, features, data sets and topics. URL-based topic classification is useful when the content of the web page is not available or the content is hidden in images. Topic classifiers based on URLs can be used to filter information and in applications like topic focused crawlers. Although content based topic classifiers give better performances, our URL-based topic classifiers work reasonably well and can be used as a signal to improve the performance of content based classifiers. In the third part of this thesis, we compare the state of art web page sampling algorithms and analyze the samples returned by these algorithms using the web properties like content length, top level domain and outdegree distribution. We discuss the strengths and weaknesses of each algorithm and propose improvements based on experimental results. The sampling algorithms we run on the web are influenced by the structure of the web. We investigate the relationship between the properties of the web and the structure of it. A uniform random sample of the web would be quite useful to learn about the composition and development of the web as it is not possible to download all the web pages to determine the properties of the web.

Fulltext

Related material