Document processing in data-scarce, domain-specific environments: The case of multilingual classical commentaries.
This dissertation investigates the challenges and opportunities of adapting state-of-the-art document processing algorithms to the intricate and relatively data-scarce field of classical scholarship. Focusing on classical commentaries â a type of publication that provides exegetical annotations on ancient texts â this study explores the unique difficulties posed by the frequent interweaving of ancient Greek with modern languages, as well as the dense scholarly layouts, and highly specialised prose.
Building on a collection of digitised commentaries, the study follows the main stages of a typical document processing pipeline, including text extraction and layout analysis, and information retrieval using domain-specific language modelling. At each stage, the study investigates the best strategies to address the limited resource of classical scholarship.
Given the pivotal role that data plays in contemporary machine learning, this study advocates a three-pronged approach: enhancing and augmenting in-domain data, specialising models, and applying these to tackle the specific challenges of document processing in the field. Ultimately, this research aims to generalise findings that may be applicable to other data-scarce, domain-specific fields and releases state-of-the-art specialised models for optical character recognition, page layout analysis, language modelling, and information extraction.
Prof. Jérôme Baudry (président) ; Prof. Frédéric Kaplan, Dr Matteo Romanello (directeurs) ; Prof. Antoine Bosselut, Prof. David Smith, Prof. Simon Clematide (rapporteurs)
2025
Lausanne
2025-01-31
11208
0