Abstract

In human vision, perception of local features depends on all elements in the visual field and their exact configuration. For example, observers performed a vernier discrimination task. When a surrounding square was added to the vernier, the task became much more difficult: a classic crowding effect. Crucially, adding more flanking squares improved performance (uncrowding). In addition, in displays of squares and stars, small changes in the configuration changed performance strongly. Here, we show that convolutional neural networks fail to address the global aspects of configuration because, first, the target and the flankers’ representations at a given layer are pooled within the receptive fields of the subsequent layer, leading to poor performance. Second, far away elements cannot interact with the vernier to produce uncrowding. We show that capsule networks, a new kind of neural network that explicitly takes configuration into account, can capture the experimental results well.

Details

Actions