Data Augmentation Strategies for Historical Named Entity Recognition
Historical Named Entity Recognition exhibits substantial performance degradation compared to modern text processing due to OCR errors, language evolution, and limited annotated training data. While various strategies have been explored, data augmentation techniques, despite proven effectiveness on modern NER benchmarks, remain unexplored for historical document processing to our knowledge. This thesis examines data augmentation strategies for historical NER through systematic error analysis and comparative evaluation on French historical newspapers. We evaluate two complementary approaches: internal augmentation inspired by mention replacement techniques (extracted from training data), and external augmentation through LLM-based corpus annotation (injected from another corpus). Results demonstrate sharply divergent effectiveness. Internal augmentation achieved substantial improvements, particularly for entity types underrepresented in the original training dataset. Furthermore, we demonstrate that systematically recombining existing training entities enables models to integrate greater surface form variation. This facilitates genuine generalization beyond training vocabulary, evidenced by improved performance on previously unseen entities. Conversely, our LLM-based annotation strategy produced systematic degradation due to pseudo-label quality failures stemming from task-specific annotation guideline complexity. This work demonstrates that simple augmentation exploiting existing training data can meaningfully improve historical NER performance, while more sophisticated annotation pipelines are needed for viable LLM-based augmentation strategies.
EPFL_MAThesis_Report_Blinière.pdf
Main Document
Not Applicable (or Unknown)
openaccess
CC BY-SA
6.13 MB
Adobe PDF
5fbf7b467296bf2597ffaf3218df703b