In crowding, the perception of a target deteriorates in the presence of clutter. Crowding is usually explained within the framework of object recognition, where processing proceeds in a hierarchical and feedforward fashion from the analysis of low level features, such as lines and edges, to high level features, such shapes and objects. Here, reviewing work of the last two years, we will show evidence that these models fail to explain a large body of findings, which undermine the philosophy of this approach as such. We propose that the configuration of more or less all elements across the entire visual field determines crowding. Wholes, such as objects and shapes, determine performance on their constituting elements. Perceptual grouping and Gestalt, neglected for a long time, are key to understand crowding and object recognition in general.