Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study
 
conference paper

Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study

Boros, Emanuela  orcid-logo
•
Ehrmann, Maud  
•
Matteo Romanello
Show more
February 18, 2024
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
The 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The quality of automatic transcription of heritage documents, whether from printed, manuscripts or audio sources, has a decisive impact on the ability to search and process historical texts. Although significant progress has been made in text recognition (OCR, HTR, ASR), textual materials derived from library and archive collections remain largely erroneous and noisy. Effective post-transcription correction methods are therefore necessary and have been intensively researched for many years. As large language models (LLMs) have recently shown exceptional performances in a variety of text-related tasks, we investigate their ability to amend poor historical transcriptions. We evaluate fourteen foundation language models against various post-correction benchmarks comprising different languages, time periods and document types, as well as different transcription quality and origins. We compare the performance of different model sizes and different prompts of increasing complexity in zero and few-shot settings. Our evaluation shows that LLMs are anything but efficient at this task. Quantitative and qualitative analyses of results allow us to share valuable insights for future work on post-correcting historical texts with LLMs.

  • Files
  • Details
  • Metrics
Type
conference paper
Author(s)
Boros, Emanuela  orcid-logo
Ehrmann, Maud  
Matteo Romanello
Najem-Meyer, Sven  
Kaplan, Frédéric  
Date Issued

2024-02-18

Publisher

Association for Computational Linguistics

Published in
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
ISBN of the book

979-8-89176-069-1

Start page

133

End page

159

Subjects

large language models

•

OCR post-correction

•

historical texts

•

evaluation

URL

Link ACL Anthology

https://aclanthology.org/2024.latechclfl-1.14/
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
DHLAB  
Event nameEvent placeEvent date
The 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

St Julian's, Malta

March 22, 2024

Available on Infoscience
February 18, 2024
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/203985
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés