Recent advances in Computer Vision are changing our way of living and enabling new applications for both leisure and professional use. Regrettably, in many industrial domains the spread of state-of-the-art technologies is made challenging by the abundance of nuisances that corrupt existing techniques beyond the required dependability. This is especially true for object localization and tracking, that is, the problem of detecting the presence of objects on images and videos and estimating their pose. This is a critical task for applications such as Augmented Reality (AR), robotic autonomous navigation, robotic object grasping, or production quality control; unfortunately, the reliability of existing techniques is harmed by visual features such as the abundance of specular and poorly textured objects, cluttered scenes, or artificial and in-homogeneous lighting. In this thesis, we propose two methods for robustly estimating the pose of a rigid object under the challenging conditions typical of industrial environments. Both methods rely on monocular images to handle metallic environments, on which depth cameras would fail; both are conceived with a limited computational and memory footprint, so that they are suitable for real-time applications such as AR. We test our methods on datasets issued from real user case scenarios, exhibiting challenging conditions. The first method is based on a global image alignment framework and a robust dense descriptor. Its global approach makes it robust in presence of local artifacts such as specularities appearing on metallic objects, ambiguous patterns like screws or wires, and poorly textured objects. Employing a global approach avoids the need of reliably detecting and matching local features across images, that become ill-conditioned tasks in the considered environments; on the other hand, current methods based on dense image alignment usually rely on luminous intensities for comparing the pixels, which is not robust in presence of challenging illumination artifacts. We show how the use of a dense descriptor computed as a non-linear function of luminous intensities, that we refer to as ``Descriptor Fields'', greatly enhances performances at a minimal computational overhead. Their low computational complexity and their ease of implementation make Descriptor Fields suitable for replacing intensities in a wide number of state-of-the-art techniques based on dense image alignment. Relying on a global approach is appropriate for overcoming local artifacts, but it can be un-effective when the target object undergoes extreme occlusions in cluttered environments. For this reason, we propose a second approach based on the detection of discriminative object parts. At the core of our approach is a novel representation for the 3D pose of the parts, that allows us to predict the 3D pose of the object even when only a single part is visible; when several parts are visible, we can easily combine them to compute a better pose of the object. The 3D pose we obtain is usually very accurate, even when only few parts are visible. We show how to use this representation in a robust 3D tracking framework. In addition to extensive comparisons with the state-of-the-art, we demonstrate our method on a practical Augmented Reality application for maintenance assistance in the ATLAS particle detector at CERN.