Historical newspaper semantic segmentation using visual and textual features

Barman, Raphaël

Barman, Raphaël

2019

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Mass digitization and the opening of digital libraries gave access to a huge amount of historical newspapers. In order to bring structure into these documents, current techniques generally proceed in two distinct steps. First, they segment the digitized images into generic articles and then classify the text of the articles into finer-grained categories. Unfortunately, by losing the link between layout and text, these two steps are not able to account for the fact that newspaper content items have distinctive visual features. This project proposes two main novelties. Firstly, it introduces the idea of merging the segmentation and classification steps, resulting in a fine- grained semantic segmentation of newspapers images. Secondly, it proposes to use textual features under the form of embeddings maps at segmentation step. The semantic segmentation with four categories (feuilleton, weather forecast, obituary, and stock exchange table) is done using a fully convolutional neural network and reaches a mIoU of 79.3%. The introduction of embeddings maps improves the overall performances by 3% and the generalization across time and newspapers by 8% and 12%, respectively. This shows a strong potential to consider the semantic aspect in the segmentation of newspapers and to use textual features to improve generalization.

Details

Title Historical newspaper semantic segmentation using visual and textual features

Author(s) Barman, Raphaël

Advisor(s)

Ehrmann, Maud
Ares Oliveira, Sofia
Clematide, Simon

Pagination 74

Date 2019-06-21

Laboratories DHMA
DHLAB

Record Appears in Scientific production and competences > CDH - College of Humanities and social sciences > Digital Humanities Institute > DHMA - Master’s project in Digital Humanities
Scientific production and competences > CDH - College of Humanities and social sciences > Digital Humanities Institute > DHLAB - Digital Humanities Laboratory
Scientific production and competences > CDH - College of Humanities and social sciences > DHMA - Master’s project in Digital Humanities
Work produced at EPFL
Student projects

Work type Master's Thesis

Record creation date 2019-10-14

Files

Abstract

Details

PDF