Human vision has evolved to make sense of a world in which elements almost never appear in isolation. Surprisingly, the recognition of an element in a visual scene is strongly limited by the presence of other nearby elements, a phenomenon known as visual crowding. Crowding impacts vision at all levels and is thus a versatile tool to understand the fundamental mechanisms of vision. For decades, visual crowding was perfectly well explained by traditional feedforward models of vision. In these models, vision starts with the detection of low-level features. This information is combined locally along the hierarchy of the visual cortex to build more and more complex feature detectors, until neurons respond selectively and robustly to complex objects. Crowding happens when nearby elements interfere in this local feature combination process and impair target recognition. However, recent studies have shown that crowding is not determined by local interactions but by the global configuration across the entire visual field. Depending on how elements group together, crowding can even almost disappear, a phenomenon called uncrowding. Hence, crowding is rather a complex, global and high-level phenomenon, that simple feedforward models cannot explain. In this thesis, I first analyse which models of crowding can explain uncrowding. I compare the performance of diverse models, selected according to different architectural and functional features, such as feedforward vs. recurrent architecture, local or global information processing, including a grouping stage or not. I show that the only model that reproduces human behaviour includes a dedicated recurrent grouping processing stage. Second, I show that global effects in crowding cannot be explained by low-level accounts. It was argued that the Texture Tiling model, based on a complex and high-dimensional pooling stage, may account for global effects in crowding, without requiring any recurrent grouping stage. To test this model, I use a large pool of recent crowding data. I show that the Texture Tiling model is equivalent to a simple pooling model and is thus as limited as these models. Next, I focus on deep neural networks, which are well in the spirit of the feedforward framework of vision and have become state-of-the-art models both in computer vision and neuroscience. I test whether AlexNet and ResNet-50, which have been proposed as realistic models of the visual system, exhibit uncrowding. I show that these networks do not reproduce uncrowding for principled reasons. Finally, I use a genetic algorithm to generate stimuli based on the performance of different models, i.e., in a bottom-up manner. The goal is to avoid using stimuli that favour models of grouping from the start. I compare the distribution of stimuli that are produced by the models to the ones that are produced by humans. I show that only the models that include grouping and segmentations processes behave like humans. Taken together, the results in my thesis highlight the importance of recurrent grouping and segmentation processes in human vision when large portions of the visual field are involved. These results can be used as direct guidelines for future models of vision, in order to constraint how recurrent processing should be incorporated to improve the performance of deep neural networks and other feedforward models of vision, and help them generalize to more complex visual inputs.