Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Document processing in data-scarce, domain-specific environments: The case of multilingual classical commentaries
 
Loading...
Thumbnail Image
doctoral thesis

Document processing in data-scarce, domain-specific environments: The case of multilingual classical commentaries

Najem-Meyer, Sven  
2025

This dissertation investigates the challenges and opportunities of adapting state-of-the-art document processing algorithms to the intricate and relatively data-scarce field of classical scholarship. Focusing on classical commentaries â a type of publication that provides exegetical annotations on ancient texts â this study explores the unique difficulties posed by the frequent interweaving of ancient Greek with modern languages, as well as the dense scholarly layouts, and highly specialised prose.

Building on a collection of digitised commentaries, the study follows the main stages of a typical document processing pipeline, including text extraction and layout analysis, and information retrieval using domain-specific language modelling. At each stage, the study investigates the best strategies to address the limited resource of classical scholarship.

Given the pivotal role that data plays in contemporary machine learning, this study advocates a three-pronged approach: enhancing and augmenting in-domain data, specialising models, and applying these to tackle the specific challenges of document processing in the field. Ultimately, this research aims to generalise findings that may be applicable to other data-scarce, domain-specific fields and releases state-of-the-art specialised models for optical character recognition, page layout analysis, language modelling, and information extraction.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH11208.pdf

Type

Main Document

Access type

restricted

License Condition

N/A

Size

12.59 MB

Format

Adobe PDF

Checksum (MD5)

0d12bff7bfdae1ecb3408e0e89256ea9

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés