We propose an approach to overcome the two main challenges of 3D multiview object detection and localization: The variation of object features due to changes in the viewpoint and the variation in the size and aspect ratio of the object. Our approach proceeds in three steps. Given an initial bounding box of fixed size, we first refine its aspect ratio and size. We can then predict the viewing angle, under the hypothesis that the bounding box actually contains an object instance. Finally, a classifier tuned to this particular viewpoint checks the existence of an instance. As a result, we can find the object instances and estimate their poses, without having to search over all window sizes and potential orientations. We train and evaluate our method on a new object database specifically tailored for this task, containing real-world objects imaged over a wide range of smoothly varying viewpoints and significant lighting changes. We show that the successive estimations of the bounding box and the viewpoint lead to better localization results.