Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Model Non-Collapse: Minimax Bounds for Recursive Discrete Distribution Estimation
 
research article

Model Non-Collapse: Minimax Bounds for Recursive Discrete Distribution Estimation

Kanabar, Millen  
•
Gastpar, Michael  
March 2026
IEEE Transactions on Information Theory

Learning discrete distributions from i.i.d. samples is a well-understood problem. However, advances in generative machine learning prompt an interesting new, non-i.i.d. setting: after receiving a certain number of samples, an estimated distribution is fixed, and samples from this estimate are drawn and introduced into the sample corpus, undifferentiated from real samples. Subsequent generations of estimators now face contaminated environments, a scenario referred to in the machine learning literature as self-consumption. Empirically, it has been observed that models in fully synthetic self-consuming loops collapse—their performance deteriorates with each batch of training—but accumulating data has been shown to prevent complete degeneration. This, in turn, begs the question: What happens when fresh real samples are added at every stage? In this paper, we study the minimax loss of self-consuming discrete distribution estimation in such loops. We show that even when model collapse is consciously averted, the ratios between the minimax losses with and without source information can grow unbounded as the batch size increases. In the data accumulation setting, where all batches of samples are available for estimation, we provide minimax lower bounds and upper bounds that are order-optimal under mild conditions for the expected ℓ22 and ℓ1 losses at every stage. We provide conditions for regimes where there is a strict gap in the convergence rates compared to the corresponding oracle-assisted minimax loss where real and synthetic samples are differentiated, and provide examples where this gap is easily observed. We also provide a lower bound on the minimax loss in the data replacement setting, where only the latest batch of samples is available, and use it to find a lower bound for the worst-case loss for bounded estimate trajectories.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

2501.19273v4.pdf

Type

Main Document

Version

Submitted version (Preprint)

Access type

openaccess

License Condition

CC BY

Size

671.79 KB

Format

Adobe PDF

Checksum (MD5)

b8fa1d24d049f788454d572aa56e35d4

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés