Information retrieval (IR) systems such as search engines are important for people to find what they need among the tremendous amount of data available in their organization or on the Internet. These IR systems enable users to search in a large data collection by specifying queries that describe their information needs. Traditionally, the data elements in these collections are text documents that have no explicit relationships between them. As conventional IR systems are designed to handle text documents, the queries are limited to multisets of textual keywords. However, with the advance of social media, the data collection has become heterogenous in terms of modality as it has become easier for users to share not only texts but also images, audios and videos. In addition, data collections have evolved to contain also the relationships between data elements. For instance, in social networks, the relationships between users are as important as the users themselves. Given these changes in the data collection, conventional bags of keywords queries become underwhelming as they are unimodal which cannot handle the heterogeneity the data elements. Moreover, they ignore the relationships between the query terms as they consider them to be independent. In this thesis, we show how to support context-rich queries both in terms of heterogeneity and interconnectivity by exploiting a common underlying graph model for the data collection. Our approach follows the vector space retrieval model where we design graph embedding techniques to represent each data element and each query as a vector i.e. an embedding. Our embedding model is designed to capture both the heterogeneity and the interconnectivity available in a data collection or a query in an elegant manner. As the data collection is usually large, this leads to a very large graph model which cannot be handled by traditional graph embedding techniques. In this thesis, we also propose an approach to make graph embedding scalable to large graphs.
EPFL_TH8046.pdf
n/a
openaccess
Copyright
3.25 MB
Adobe PDF
e9905398914805ee5060270843703e40