Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. An end-to-end pipeline for historical censuses processing
 
research article

An end-to-end pipeline for historical censuses processing

Petitpierre, Rémi Guillaume  
•
Kramer, Marion  
•
Rappo, Lucas Arnaud André  
March 17, 2023
International Journal on Document Analysis and Recognition (IJDAR)

Censuses are structured documents of great value for social and demographic history, which became widespread from the nineteenth century on. However, the plurality of formats and the natural variability of historical data make their extraction arduous and often lead to ungeneric recognition algorithms. We propose an end-to-end processing pipeline, based on optimization, in an attempt to reduce the number of free parameters. The layout analysis is based on semantic segmentation using neural networks for a generic recognition of the explicit column structure. The implicit row structure is deduced directly from the position of the text segments. The handwritten text detection is complemented by an intelligent framing method which significantly improves the quality of the HTR. In the end, we propose to combine several post-correction approaches, neural networks, and language models, to further improve the performance. Ultimately, our flexible methods make it possible to accurately detect more than 98% of the columns and 88% of the rows, despite the lack of graphical separator and the diversity of formats. Thanks to various reframing and post-correction strategies, HTR results reach the excellent performance of 3.44% character error rate on these noisy nineteenth century data. In total, more than 18,831 pages were extracted in 72 censuses over a century. This large historical dataset, as well as training data, is made open-access and released along with this article.

  • Files
  • Details
  • Metrics
Type
research article
DOI
10.1007/s10032-023-00428-9
Author(s)
Petitpierre, Rémi Guillaume  
Kramer, Marion  
Rappo, Lucas Arnaud André  
Date Issued

2023-03-17

Published in
International Journal on Document Analysis and Recognition (IJDAR)
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
DHI-GE  
CDH  
RelationURL/DOI

IsSupplementedBy

https://infoscience.epfl.ch/record/301983

IsSupplementedBy

https://infoscience.epfl.ch/record/301993
Available on Infoscience
March 22, 2023
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/196345
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés