Résumé

Classically, visual processing is described as a cascade of local feedforward computations. This view has received striking support from the success of convolutional neural networks (CNNs). However, CNNs only roughly mimic human vision. For example, CNNs do not take the global spatial configuration of visual elements into account and thus fail at simple tasks such as explaining crowding and uncrowding. In crowding, the perception of a target deteriorates in the presence of neighboring elements. Classically, adding flanking elements is thought to always decreases performance. However, adding flankers even far away from the target can improve performance, depending on the global configuration (uncrowding). We showed previously that no classic model of crowding, including CNNs, can explain uncrowding. Here, we show that capsule networks, a type of deep network combining CNNs and object segmentation, explain both crowding and uncrowding. We trained capsule networks to recognize targets and groups of shapes. There were no crowding/uncrowding stimuli in the training set. When we subsequently tested the network on crowding/uncrowding stimuli, both crowding and uncrowding occurred. We show theoretically how crowding and uncrowding naturally emerge from neural dynamics in capsule networks. These powerful recurrent models offer a new framework to understand previously unexplained experimental results.

Détails