Dense image-based prediction methods have advanced tremendously in recent years. Their remarkable development has been possible due to the ample availability of real-world imagery. While these methods work well on photographs, their abilities do not generalize to domains with limited data such as cartoons, sketches, or comics. This issue is exacerbated by the increased dissimilarity between the photographic domain and the domains with inadequate data. To overcome the restricted abilities of the real-world dense image-based prediction methods in such disparate domains, we apply the following solutions- we leverage a commonly-used strategy known as unsupervised image-to-image translation, which allows us to utilize a large corpus of real-world annotations; - we utilize the results of the translations to develop various dense prediction methods that encompass single-task prediction and different multitasking approaches; -we extensively study task relationships in multitasking frameworks to strengthen the generalizability of the developed methods across disparate domains and unseen tasks. Beyond analyzing the capabilities of our developed solutions on different photographic domains, we study the applicability of our methods on comics imagery, which is a highly discrepant domain comprising insufficient data.
In the first of the mentioned research axes, we introduce a detection-based unsupervised image-to-image translation method, that explicitly accounts for the object instances in the translation process. Translating the images belonging to a data-restricted domain to real imagery is an imperative step, undertaken to address the lack of annotations in the former domain. Precisely, we extract separate representations for the global image and for the object instances, which we then fuse into a common representation from which we generate the translated image. For comics imagery, this allows us to preserve the detailed content of the graphical instances within a panel while translating the instance styles more accurately than the state-of-the-art translation methods.
In the second one, we study the 3D information embedded in comics imagery by developing a context-based depth estimation method. To that end, we utilize the aforementioned unsupervised image-to-image translation method to translate the comics images into natural ones and then use an attention-guided monocular depth estimator to predict their depth. This lets us leverage the depth annotations of existing natural images to train the depth estimator. Furthermore, our model learns multi-modal information by distinguishing between text and images in the comics panels to reduce text-based artifacts in the depth estimates. However, we do not explicitly study the semantic units in this axis of research.
Unifying the dense image-based prediction tasks in a single framework calls for developing a multitasking method. We, therefore, introduce an end-to-end transformer-based multitasking framework. This allows us to simultaneously solve multiple dense tasks. We achieve unification by modeling the pairwise task relationships from empirical observations. However, the predictions from the preceding methods are not transferable to novel tasks. Therefore, to learn transferable and generalizable task relationships, we introduce multitasking vision adapters that learn to solve multiple dense predictions in a parameter-efficient manner while generalizing to unseen tasks and different domains.
EPFL_TH10212.pdf
n/a
openaccess
copyright
13.34 MB
Adobe PDF
a1fca3ca313f4387f213d5e2a5df9932