Lausanne Historical Censuses Dataset HTR 35k

Rappo, Lucas; Petitpierre, Rémi; Kramer, Marion

doi:10.5281/zenodo.7780712

dataset

Lausanne Historical Censuses Dataset HTR 35k

Rappo, Lucas

•

Petitpierre, Rémi

•

Kramer, Marion

2023

Zenodo

This training dataset includes a total of 34,913 manually transcribed text segments. It is dedicated to the handwritten text recognition (HTR) of historical sources, typically tabular records, such as censuses. This dataset is based on a sample of 83 pages from the 19th century (1805-1898) censuses of Lausanne, Switzerland. The primary language of the documents is French, although many germanic names and toponyms are also found. The training data are formatted and provided on the model of the Bentham dataset. The format thus simply consists in a list of jpeg images, one per text segments, and their corresponding transcription, stored in a txt file. The file naming convention is 'yyyy-ppp-n', where 'y' stands for the year of publication of the census, and 'p' for the page number. The digitized documents are provided by the Archives of the City of Lausanne. Please note that the annotation and extraction methodology, as well as the complete evaluation of performance, including HTR benchmark and post-correction performance is published in : Petitpierre R., Rappo L., Kramer M. (2023). An end-to-end pipeline for historical censuses processing. International Journal on Document Analysis and Recognition (IJDAR). doi: 10.1007/s10032-023-00428-9 Tabular dataset resulting from automatic extraction are also available on Zenodo : Petitpierre R., Rappo L., Kramer M., di Lenardo I. (2023). 1805-1898 Census Records of Lausanne : a Long Digital Dataset for Demographic History. Zenodo. doi: 10.5281/zenodo.7711640

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/196999

File(s)

Name

Images.zip

Type

n/a

Access type

openaccess

License Condition

CC BY

Size

150.66 MB

Format

ZIP

Checksum (MD5)

8773a7b0aba17a5fd926738cc78faf4f

Name

README.md.txt