Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Language Resources for Historical Newspapers: the Impresso Collection
 
conference paper

Language Resources for Historical Newspapers: the Impresso Collection

Ehrmann, Maud  
•
Romanello, Matteo  
•
Clematide, Simon
Show more
Calzolari, Nicoletta
•
Béchet, Frédéric
Show more
May 11, 2020
Proceedings of the 12th Language Resources and Evaluation Conference
12th International Conference on Language Resources and Evaluation (LREC)

Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge-- and real promise of digitization-- is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this Big Data of the Past'. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the impresso - Media Monitoring of the Past' project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents.

  • Files
  • Details
  • Metrics
Type
conference paper
DOI
10.5281/zenodo.4641902
Web of Science ID

WOS:000724697201005

Author(s)
Ehrmann, Maud  
Romanello, Matteo  
Clematide, Simon
Ströbel, Phillip Benjamin
Barman, Raphaël  
Editors
Calzolari, Nicoletta
•
Béchet, Frédéric
•
Blache, Philippe
•
Choukri, Khalid
•
Cieri, Christopher
•
Declerck, Thierry
•
Goggi, Sara
•
Isahara, Hitoshi
•
Maegaard, Bente
•
Mariani, Joseph
Show more
Date Issued

2020-05-11

Publisher

European Language Resources Association

Publisher place

Paris

Published in
Proceedings of the 12th Language Resources and Evaluation Conference
ISBN of the book

979-10-95546-34-4

Total of pages

10

Start page

958

End page

968

Subjects

historical and multilingual language resources

•

historical texts

•

multi-layered historical semantic annotations

•

OCR

•

named entity processing

•

topic modeling

•

text reuse

•

digital humanities

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
DHLAB  
Event nameEvent placeEvent date
12th International Conference on Language Resources and Evaluation (LREC)

Marseille, France

May 11-16 2020

RelationURL/DOI

IsSupplementedBy

https://doi.org/10.5281/zenodo.3706823

IsSupplementedBy

https://doi.org/10.5281/zenodo.3706833

IsSupplementedBy

https://doi.org/10.5281/zenodo.3709465
Show more
Available on Infoscience
September 16, 2020
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/171696
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés