Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Student works
  4. Data Augmentation Strategies for Historical Named Entity Recognition
 
master thesis

Data Augmentation Strategies for Historical Named Entity Recognition

Blinière, Léa
2026

Historical Named Entity Recognition exhibits substantial performance degradation compared to modern text processing due to OCR errors, language evolution, and limited annotated training data. While various strategies have been explored, data augmentation techniques, despite proven effectiveness on modern NER benchmarks, remain unexplored for historical document processing to our knowledge. This thesis examines data augmentation strategies for historical NER through systematic error analysis and comparative evaluation on French historical newspapers. We evaluate two complementary approaches: internal augmentation inspired by mention replacement techniques (extracted from training data), and external augmentation through LLM-based corpus annotation (injected from another corpus). Results demonstrate sharply divergent effectiveness. Internal augmentation achieved substantial improvements, particularly for entity types underrepresented in the original training dataset. Furthermore, we demonstrate that systematically recombining existing training entities enables models to integrate greater surface form variation. This facilitates genuine generalization beyond training vocabulary, evidenced by improved performance on previously unseen entities. Conversely, our LLM-based annotation strategy produced systematic degradation due to pseudo-label quality failures stemming from task-specific annotation guideline complexity. This work demonstrates that simple augmentation exploiting existing training data can meaningfully improve historical NER performance, while more sophisticated annotation pipelines are needed for viable LLM-based augmentation strategies.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_MAThesis_Report_Blinière.pdf

Type

Main Document

Version

Not Applicable (or Unknown)

Access type

openaccess

License Condition

CC BY-SA

Size

6.13 MB

Format

Adobe PDF

Checksum (MD5)

5fbf7b467296bf2597ffaf3218df703b

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés