Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Unsupervised Visual Entity Abstraction towards 2D and 3D Compositional Models
 
doctoral thesis

Unsupervised Visual Entity Abstraction towards 2D and 3D Compositional Models

Besbinar, Beril  
2022

Object-centric learning has gained significant attention over the last years as it can serve as a powerful tool to analyze complex scenes as a composition of simpler entities. Well-established tasks in computer vision, such as object detection or instance segmentation, are generally posed in supervised settings. The recent surge of fully-unsupervised approaches for entity abstraction, which often tackle the problem with generative modeling or self-supervised learning, indicates the rising interest in structured representations in the form of objects or object parts. Indeed, these can provide benefits to many challenging tasks in visual analysis, reasoning, forecasting, and planning, and provide a path for combinatorial generalization. In this thesis, we exploit different consistency constraints for disambiguating entities in fully-unsupervised settings. We first consider videos and infer entities that can be modeled by consistent motion between frames at different time steps. We unconventionally opt for representing objects with amodal masks and investigate methods to accumulate information about each entity throughout time for an occlusion-aware decomposition. Approximating motion with parametric spatial transformations enables us to impose cyclic long-term consistency that contributes to reasoning about unseen parts of entities. We then develop a video prediction model based on this decomposition scheme. As the proposed decomposition decouples motion from entity appearance, we attribute the inherent stochasticity of the video prediction problem to our parametric motion model and propose a three-stage training scheme for more plausible prediction outcomes. After deterministic decomposition at the first stage, we train our new model for short-term prediction in stochastic settings. Long-term prediction as the last step helps us learn the distribution of motion present in the dataset for each entity. Finally, we focus on multi-view image settings and assume two different arrangements where the scene is observed from different viewpoints in both cases. We attempt to find correspondences of the volumetric representations of those observations that are guided by differentiable rendering algorithms. By grouping the volume units based on consistent matching of features, we partition the volumetric representation that leads to the individual rendering of each inferred entity. We present promising outcomes for all of the proposed unsupervised object-representation schemes on synthetic datasets and present different ideas for scaling them up for the adaptation to real-world data as future work.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH8166.pdf

Type

N/a

Access type

openaccess

License Condition

copyright

Size

17.23 MB

Format

Adobe PDF

Checksum (MD5)

ca47ca4c84f8dae7d319d2e989f3142b

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés