Infoscience

Thesis

Large-Scale Image Segmentation with Convolutional Networks

Object recognition is one of the most important problems in computer vision. However, visual recognition poses many challenges when tried to be reproduced by artificial systems. A main challenge is the problem of variability: objects can appear across huge variations in pose, appearance, illumination and occlusion, and a visual system need to be robust to all these changes. In the present thesis, we are interested in pixel-level recognition problems, ie, problems in which the objective is to partition a given image into multiple regions (overlapping or not) that are considered meaningful according to some criterion. Our interests are in algorithms that require the least amount of feature engineering and are easy to scale. Deep learning methods fit very well with this objective: these models alleviate the need of engineered features by discriminatively training a system from raw data (pixels). More precisely, we propose different convolutional neural network (CNN) based algorithms to deal with three important segmentation problems: semantic segmentation, object proposal generation and object detection with segments. The objective of semantic segmentation is to generate a categorical label to each pixel present in a scene. We first study the problem of fully supervised semantic segmentation. We propose a recurrent CNN that is able to consider a large input context (while limiting its capacity), which is essential to model long range pixel label dependencies. This approach achieves state-of-the-art performance without relying on any post-processing smoothing step. However, having densely labeled images to train a model can be expensive and require a lot of human labor. We also propose a CNN-based model that is able to infer object semantic segmentation by leveraging only the object category information from images. This is achieved by casting the problem into a multiple instance learning framework. This approach beats previous state of the art in weakly supervised semantic segmentation by a large margin. Object proposal algorithms generate a set of regions (segments) that are likely to contain objects, independent of their semantic category. Contrary to most approaches (which rely on low-level vision cues), we propose a CNN-based discriminative approach that is able to learn segmentation proposals from raw pixels. This approach is proven to be quite effective in this setting, achieving substantially higher recall using fewer proposals than other methods. The state of the art is pushed further with the introduction of a new top-down network augmentation. The resulting bottom-up/top-down network combines low-level rich spatial information with high-level object semantic information to improve segmentation, while remaining fast at test time. Finally, we show that the proposals generated by our approach, when coupled with a standard state-of-the-art object detection pipeline, achieve considerably better performance than previous proposals methods.

Related material