Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Model Non-Collapse: Minimax Bounds for Recursive Discrete Distribution Estimation
 
conference paper

Model Non-Collapse: Minimax Bounds for Recursive Discrete Distribution Estimation

Kanabar, Millen  
•
Gastpar, Michael  
June 22, 2025
2025 IEEE International Symposium on Information Theory (ISIT)
2025 IEEE International Symposium on Information Theory

Learning discrete distributions from i.i.d. samples is a well-understood problem. However, advances in generative machine learning prompt an interesting new, non-i.i.d. setting: after receiving a certain number of samples, an estimated distribution is fixed, and samples from this estimate are drawn and introduced into the sample corpus, undifferentiated from real samples. Subsequent generations of estimators now face contaminated environments, a scenario referred to in the machine learning literature as self-consumption. Empirically, it has been observed that models in fully synthetic self-consuming loops collapse-their performance deteriorates with each batch of training-but accumulating data has been shown to prevent complete degeneration. This, in turn, begs the question: What happens when fresh real samples are added at every stage? In this paper, we study the minimax loss of self-consuming discrete distribution estimation in such loops. We show that even when model collapse is consciously averted, the ratios between the minimax losses with and without source information can grow unbounded as the batch size increases. In the data accumulation setting, where all batches of samples are available for estimation, we provide minimax lower bounds and upper bounds that are order-optimal under mild conditions for the expected ℓ22 and ℓ1 losses at every stage. We provide conditions and examples for regimes where there is a strict gap in the convergence rates compared to the corresponding oracle-assisted minimax loss where real and synthetic samples are differentiated. We also provide a lower bound on the minimax loss in the data replacement setting, where only the latest batch of samples is available, and use it to find a lower bound for the worst-case loss for bounded estimate trajectories.

  • Details
  • Metrics
Type
conference paper
DOI
10.1109/isit63088.2025.11195231
Author(s)
Kanabar, Millen  

École Polytechnique Fédérale de Lausanne

Gastpar, Michael  

École Polytechnique Fédérale de Lausanne

Date Issued

2025-06-22

Publisher

IEEE

Published in
2025 IEEE International Symposium on Information Theory (ISIT)
ISBN of the book

979-8-3315-4399-0

Start page

1

End page

6

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LINX  
Event nameEvent acronymEvent placeEvent date
2025 IEEE International Symposium on Information Theory

ISIT 2025

Ann Arbor, MI, USA

2025-06-22 - 2025-06-27

Available on Infoscience
October 22, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/255173
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés