Transformer Models for Vision

Cordonnier, Jean-Baptiste Francis Marie Juliette

doi:10.5075/epfl-thesis-9822

doctoral thesis

Transformer Models for Vision

Cordonnier, Jean-Baptiste Francis Marie Juliette

2023

The recent developments of deep learning cover a wide variety of tasks such as image classification, text translation, playing go, and folding proteins.
All these successful methods depend on a gradient-based learning algorithm to train a model on massive amounts of data using significant computation power.
Even though this optimization algorithm is shared, deep learning relies on different model architectures to process the training data depending on the modality: Multi-Layer Perceptrons for vectors, Convolutional Neural Networks for images, Recurrent Neural Networks for text and sequences, and Graph Neural Networks for graphs. A recent addition to this family of models is the transformer architecture developed by Vaswani et al. (2017) for text translation. This fragmented landscape of architectures forces the practitioners to select a model based on the data modality and to learn its specificities. This is detrimental when the problem at hand involves multiple data modalities, such as image captioning. A more systematic approach would be to employ a single architecture that processes all the modalities and learns the structure of the input directly from the training data.

This work takes a transversal approach between Natural Language Processing and Vision to show that the transformers, which were primarily designed to process text, can also handle images. First, we show that a self-attention layer---the building block of transformers---can provably express convolution and we demonstrate empirically that shallow layers of transformers do learn localized translated filters similar to CNNs. Our proof relies on the attention heads to mimic the receptive field of a convolutional kernel. We study how these attention heads interact and propose a new multi-head mechanism that leverages the shared representation extracted across heads. This thesis presents two adaptations of transformer models for special kinds of images. We introduce a rotation-equivariant attention layer well fitted to process images whose orientation does not hold information such as satellite images or microscopic images of biological tissues. Finally, we adapt the transformers to process large-resolution images by extracting salient patches of the images and processing them with a smaller memory footprint. Our work and the follow-up developments made the transformers a standard architecture to process images.

Name

EPFL_TH9822.pdf

Type

N/a

Access type

openaccess

License Condition

copyright

Size

34.61 MB

Format

Adobe PDF

Checksum (MD5)

9a9b888e6fa31df03432288c94dc026b