Document processing in data-scarce, domain-specific environments: The case of multilingual classical commentaries
This dissertation investigates the challenges and opportunities of adapting state-of-the-art document processing algorithms to the intricate and relatively data-scarce field of classical scholarship. Focusing on classical commentaries â a type of publication that provides exegetical annotations on ancient texts â this study explores the unique difficulties posed by the frequent interweaving of ancient Greek with modern languages, as well as the dense scholarly layouts, and highly specialised prose.
Building on a collection of digitised commentaries, the study follows the main stages of a typical document processing pipeline, including text extraction and layout analysis, and information retrieval using domain-specific language modelling. At each stage, the study investigates the best strategies to address the limited resource of classical scholarship.
Given the pivotal role that data plays in contemporary machine learning, this study advocates a three-pronged approach: enhancing and augmenting in-domain data, specialising models, and applying these to tackle the specific challenges of document processing in the field. Ultimately, this research aims to generalise findings that may be applicable to other data-scarce, domain-specific fields and releases state-of-the-art specialised models for optical character recognition, page layout analysis, language modelling, and information extraction.
EPFL_TH11208.pdf
main document
restricted
N/A
12.59 MB
Adobe PDF
0d12bff7bfdae1ecb3408e0e89256ea9