Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Student works
  4. Exploring Large Vision-Language Models for Historical Newspaper Segmentation
 
master thesis

Exploring Large Vision-Language Models for Historical Newspaper Segmentation

Papadopoulos, Danae Christine
January 31, 2025

Historical newspapers are invaluable for understanding past events, editorial viewpoints, and societal norms. Yet, although libraries worldwide have digitized vast numbers of these documents, most newspaper pages remain unsegmented, making it harder for historians to pinpoint specific subjects of interest within the content. This thesis addresses the challenge of automatically segmenting and labeling historical newspaper layouts using Large Vision Models (LVMs), focusing on data from two major collections in the Impresso corpus: the Bibliothèque Nationale du Luxembourg (BNL) and the Bibliothèque Nationale de France (BNF).

We begin by investigating existing silver-standard annotations, which suffer from inconsistencies due to multiple, heterogeneous digitization campaigns. Through detailed layout exploration, we uncover and resolve key annotation errors, merging and cleaning the corpus to provide more uniform training data. We then implement a sampling strategy designed to capture the temporal, editorial, and structural diversity inherent in historical newspapers, ensuring robust generalization to a variety of page layouts.

On top of this curated dataset, we train and evaluate a state-of-the-art YOLOS model to detect and classify nine distinct layout elements—namely article title, article, image, image caption, ad, table, death notice, weather, and earrings.

Our best-performing model achieves an overall mean Average Precision (mAP) of 0.23, with particularly strong performance on more visually distinct categories—such as images (AP 0.53) and ads (AP 0.43). While smaller or less frequent elements (e.g., weather) remain challenging, these findings demonstrate that LVMs can effectively segment historical newspaper pages when trained on carefully cleaned data. We conclude that continued refinement of annotations, along with targeted domain-specific pretraining, holds promise for further improving automatic layout analysis in historical newspapers.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

MA_thesis_Danae_Papadopoulos.pdf

Type

Main Document

Version

http://purl.org/coar/version/c_be7fb7dd8ff6fe43

Access type

openaccess

License Condition

CC BY-SA

Size

405.85 MB

Format

Adobe PDF

Checksum (MD5)

11e0179dfd4f801ae06f060ea75ce0cc

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés