Learning from Images with Captions Using the Maximum Margin Set Algorithm

A large amount of images with accompanying text captions are available on the Internet. These are valuable for training visual classifiers without any explicit manual intervention. In this paper, we present a general framework to address this problem. Under this new framework, each training image is represented as a bag of regions, associated with a set of candidate labeling vectors. Each labeling vector encodes the possible labels for the regions of the image. The set of all possible labeling vectors can be generated automatically from the caption using natural language processing techniques. The use of labeling vectors provides a principled way to include diverse information from the captions, such as multiple types of words corresponding to different attributes of the same image region, labeling constraints derived from grammatical connections between words, uniqueness constraints, and spatial position indicators. Moreover, it can also be used to incorporate high-level domain knowledge useful for improving learning performance. We show that learning is possible under this weakly supervised setup. Exploiting this property of the problem, we propose a large margin discriminative formulation, and an efficient algorithm to solve the proposed learning problem. Experiments conducted on artificial datasets and two real-world images and captions datasets support our claims.


    • EPFL-REPORT-192507

    Record created on 2013-12-19, modified on 2016-08-09

Related material