Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Student works
  4. Quantifying Training Data Retention in Large Language Models: An Analysis of Pretraining Factors and Mitigation Strategies
 
master thesis

Quantifying Training Data Retention in Large Language Models: An Analysis of Pretraining Factors and Mitigation Strategies

Xu, Yixuan
August 29, 2025

Large language models (LLMs) have demonstrated remarkable capabilities but face a significant challenge: inadvertently memorizing portions of their training data. This memorization raises serious concerns about privacy and potential copyright violations. Most existing efforts to address this issue are reactive, filtering model outputs or fine-tuning after problematic content has been encoded into the model. This thesis takes a proactive approach by investigating methods to control memorization during the pretraining phase. The study leverages Goldfish Loss, an innovative modification to the training objective designed to discourage verbatim memorization of long sequences like copyrighted documents. This thesis compares two experimental conditions: Dense Gutenberg (extreme data repetition) and Sparse Gutenberg (sparse text inclusion). The experiments uncover a critical phenomenon dubbed the Offset Effect, revealing that minor shifts in a prompt’s starting position can dramatically alter whether a model reproduces memorized text. The study reveals a connection between memorization and text degradation. When models cannot retrieve memorized content due to mitigation strategies or limited exposure, they tend to generate repetitive, lower-quality text. This finding suggests that memory retrieval limitations drive degenerative output. By quantifying these insights, this thesis emphasizes introducing the offset dimension into verbatim memorization evaluation frameworks. The approach promises more accurate assessment methods and proactive strategies for mitigating memorization risks in large language models.

  • Files
  • Details
  • Metrics
Type
master thesis
Author(s)
Xu, Yixuan
Advisors
Bosselut, Antoine  
•
Schlag, Imanol
Date Issued

2025-08-29

Publisher

EPFL

Publisher place

Lausanne

Total of pages

67

Written at

EPFL

EPFL units
NLP  
Faculty
IC  
Section
IC-SIN  
Available on Infoscience
September 1, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/253615
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés