Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Toward Interoperable and Scalable Representations of Complex Heterogeneous Digitized Historical Media
 
conference paper

Toward Interoperable and Scalable Representations of Complex Heterogeneous Digitized Historical Media

Conti, Pauline Isabela  
•
Clematide, Simon
•
Ehrmann, Maud  
May 16, 2026
Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers (PressMint 2026)
First Workshop on Creating Interoperable Corpora of Historical Newspapers (PressMint-LREC2026)

The value of digitized historical media archives for computational historical research is now well established, yet an underexplored challenge concerns data management itself: how to represent and process, at scale, complex primary sources that vary widely in digitization granularity, refinement quality, and archival organization and curation practices. This paper presents the data representation framework designed for large-scale processing and indexing of historical newspapers and radio broadcasts developed within the Impresso project. Grounded in a structured characterization of the heterogeneity found in digitized historical media collections, it identifies the distinct dimensions along which collections diverge and the challenges they pose for a unified representation and processing framework. The framework navigates the competing demands of machine learning pipelines requiring uniform and lightweight document representations, information retrieval systems requiring well-defined indexable content units, user-facing interfaces requiring fidelity to original sources, and the need to return semantically enriched data to archival holders in interoperable formats. We describe the design principles guiding the framework and discuss how it reconciles these constraints across highly heterogeneous collections into a unified and research-ready corpus.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

Impresso_Data_Representation_LREC2026_PressMintWS.pdf

Type

Main Document

Version

Published version

Access type

openaccess

License Condition

CC BY-NC

Size

6.79 MB

Format

Adobe PDF

Checksum (MD5)

9856de354a664f22f8d9ee51d2813af4

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés