Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Student works
  4. Quantifying Training Data Retention in Large Language Models: An Analysis of Pretraining Factors and Mitigation Strategies
 
master thesis

Quantifying Training Data Retention in Large Language Models: An Analysis of Pretraining Factors and Mitigation Strategies

Xu, Yixuan
August 29, 2025

Large language models (LLMs) have demonstrated remarkable capabilities but face a significant challenge: inadvertently memorizing portions of their training data. This memorization raises serious concerns about privacy and potential copyright violations. Most existing efforts to address this issue are reactive, filtering model outputs or fine-tuning after problematic content has been encoded into the model. This thesis takes a proactive approach by investigating methods to control memorization during the pretraining phase. The study leverages Goldfish Loss, an innovative modification to the training objective designed to discourage verbatim memorization of long sequences like copyrighted documents. This thesis compares two experimental conditions: Dense Gutenberg (extreme data repetition) and Sparse Gutenberg (sparse text inclusion). The experiments uncover a critical phenomenon dubbed the Offset Effect, revealing that minor shifts in a prompt’s starting position can dramatically alter whether a model reproduces memorized text. The study reveals a connection between memorization and text degradation. When models cannot retrieve memorized content due to mitigation strategies or limited exposure, they tend to generate repetitive, lower-quality text. This finding suggests that memory retrieval limitations drive degenerative output. By quantifying these insights, this thesis emphasizes introducing the offset dimension into verbatim memorization evaluation frameworks. The approach promises more accurate assessment methods and proactive strategies for mitigating memorization risks in large language models.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

MSc_Thesis.pdf

Type

Main Document

Version

Not Applicable (or Unknown)

Access type

openaccess

License Condition

CC BY

Size

2.16 MB

Format

Adobe PDF

Checksum (MD5)

2c02ccdaeec541c7342dcc90d6bbbe8b

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés