Object tracking and detection over a wide range of viewpoints is a long-standing problem in Computer Vision. Despite significant advance in wide-baseline sparse interest point matching and development of robust dense feature models, it remains a largely open problem. Moreover, abundance of low cost mobile platforms and novel application areas, such as real-time Augmented Reality, constantly push the performance limits of existing methods. There is a need to modify and adapt these to meet more stringent speed and capacity requirements. In this thesis, we aim to overcome the difficulties due to the multi-view nature of the object detection task. We significantly improve upon existing statistical keypoint matching algorithms to perform fast and robust recognition of image patches independently of object pose. We demonstrate this on various 2D and 3D datasets. The statistical keypoint matching approaches require massive amounts of training data covering a wide range of viewpoints. We have developed a weakly supervised algorithm to greatly simplify their training for 3D objects. We also integrate this algorithm in a 3D tracking-by-detection system to perform real-time Augmented Reality. Finally, we extend the use of a large training set with smooth viewpoint variation to category-level object detection. We introduce a new dataset with continuous pose annotations which we use to train pose estimators for objects of a single category. By using these estimators' output to select pose specific classifiers, our framework can simultaneously localize objects in an image and recover their pose. These decoupled pose estimation and classification steps yield improved detection rates. Overall, we rely on image and video sequences to train classifiers that can either operate independently of the object pose or recover the pose parameters explicitly. We show that in both cases our approaches mitigate the effects of viewpoint changes and improve the recognition performance.