On the Relationship between Self-Attention and Convolutional Layers

Recent trends of incorporating attention mechanisms in vision have led re- searchers to reconsider the supremacy of convolutional layers as a primary build- ing block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work pro- vides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code is publicly available.

Presented at:
Eighth International Conference on Learning Representations - ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020
Additional link:

Note: The status of this file is: Anyone

 Record created 2020-01-10, last modified 2020-10-25

Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)