Leveraging User-Generated Content for Information Discovery on the Web

Budura, Adriana

doi:10.5075/epfl-thesis-4715

doctoral thesis

Leveraging User-Generated Content for Information Discovery on the Web

2010

The large-scale adoption of the Web 2.0 paradigm has revolutionized the way we interact with the Web today. End-users, so far mainly passive consumers of information are now becoming active information producers, creating, uploading, and commenting on all types of digital content. As a consequence, the Web has evolved from a collection of static HTML pages to a highly interactive system, where information is being published and consumed at high rates. This has tremendously increased the amount of data available on the Web today, which brings about new challenges in terms of information management. At the same time, the increased user participation represents a new and extremely valuable source of data. While interacting with different Web 2.0 portals, users freely provide all types of information, such as annotations describing the shared resources, friendship links connecting similar users, etc., which can be exploited in order to improve the methods designed to manage online content. A particularly interesting example of user-generated data are the so-called social annotations, that users attach to resources in the context of collaborative tagging systems. This kind of meta-information opens up new opportunities for improved content search, new means to organize personal data, and ways of mining user profiles based on their annotations. Virtual friendship connections between users, as we can observe in social networks, are another rich source of information as they often group users with similar interests together, give means to study information diffusion and open ground to enhanced expert finding tasks. In this thesis, we leverage information extracted from user-generated data, in order to solve current information management problems, such as data retrieval, mining and integration. We explore different scenarios, where online content is enriched with user-defined meta-information and we identify specific problems, which we solve by leveraging this information. We start by addressing the problem of context-based information discovery in collaborative tagging systems, where we take advantage of user-defined entity graphs – such as a citation graph of publications or a friendship graph of users. In this setting, effective search solutions require a certain amount of annotations, however, content is often poorly annotated. We therefore propose a method that exploits the context-related information embedded in the graph structure, in order to automatically infer new annotations. Our approach propagates tags along the edges of the graph, based on the assumption that the neighborhood of a resource holds additional information about the resource itself. We see a similar graph structure in social communities, where users are connected via friendship links and where the neighborhood of a user reflects her community of interest. We adopt the hypothesis that users mainly annotate resources of interest to them and interpret the annotations (i.e., tags) as an interest profile. Hence, we propose a novel framework for tag-based community detection in collaborative tagging systems, that considers both the tagging behavior of users, as well as their friendship graph. Based on a set of tags, our method returns a closely connected community of users, whose tags jointly cover the initial set. In order to further investigate the issue of generating user profiles based on their annotations, we switch our attention from the open Web to an enterprise setting. Social software, such as collaborative tagging systems, has also been included in the enterprise space, where it opens up new opportunities to address the problem of expert search. We take advantage of the data extracted from two enterprise-internal portals and explore correlations between the users‚ tagging behavior and their corresponding areas of expertise. Based on these correlations, we devise a method that derives expertise profiles for users. Finally, we investigate how user-generated meta-information can be exploited in the domain of structured data on the Web, i.e., data that is organized according to the relational model and complies to application-specific schemas. In order to enable transparent search solutions on such data, schema heterogeneity needs to be overcome by means of data integration techniques. We explore how such techniques can benefit from the user-generated meta-information in the form of links between similar entries in different databases. Based on these links, we devise a method to create mappings between the elements of different schemas from a real-world collection of online bioinformatic databases.

Name

EPFL_TH4715.pdf

Access type

restricted

Size

1.1 MB

Format

Adobe PDF

Checksum (MD5)

0ee302a8595d01df32d35fe8073307da