Exploring Large Vision-Language Models for Historical Newspaper Segmentation
Historical newspapers are invaluable for understanding past events, editorial viewpoints, and societal norms. Yet, although libraries worldwide have digitized vast numbers of these documents, most newspaper pages remain unsegmented, making it harder for historians to pinpoint specific subjects of interest within the content. This thesis addresses the challenge of automatically segmenting and labeling historical newspaper layouts using Large Vision Models (LVMs), focusing on data from two major collections in the Impresso corpus: the Bibliothèque Nationale du Luxembourg (BNL) and the Bibliothèque Nationale de France (BNF).
We begin by investigating existing silver-standard annotations, which suffer from inconsistencies due to multiple, heterogeneous digitization campaigns. Through detailed layout exploration, we uncover and resolve key annotation errors, merging and cleaning the corpus to provide more uniform training data. We then implement a sampling strategy designed to capture the temporal, editorial, and structural diversity inherent in historical newspapers, ensuring robust generalization to a variety of page layouts.
On top of this curated dataset, we train and evaluate a state-of-the-art YOLOS model to detect and classify nine distinct layout elements—namely article title, article, image, image caption, ad, table, death notice, weather, and earrings.
Our best-performing model achieves an overall mean Average Precision (mAP) of 0.23, with particularly strong performance on more visually distinct categories—such as images (AP 0.53) and ads (AP 0.43). While smaller or less frequent elements (e.g., weather) remain challenging, these findings demonstrate that LVMs can effectively segment historical newspaper pages when trained on carefully cleaned data. We conclude that continued refinement of annotations, along with targeted domain-specific pretraining, holds promise for further improving automatic layout analysis in historical newspapers.
MA_thesis_Danae_Papadopoulos.pdf
Main Document
http://purl.org/coar/version/c_be7fb7dd8ff6fe43
openaccess
CC BY-SA
405.85 MB
Adobe PDF
11e0179dfd4f801ae06f060ea75ce0cc