Studying Linguistic Changes on 200 Years of Newspapers

Large databases of scanned newspapers open new avenues for studying linguistic evolution. By studying a two-billion-word corpus corresponding to 200 years of newspapers, we compare several methods in order to assess how fast language is changing. After critically evaluating an initial set of methods for assessing textual distance between subsets corresponding to consecutive years, we introduce the notion of a lexical kernel, the set of unique words that maintain themselves over long periods of time. Focusing on linguistic stability instead of linguistic change allows building more robust measures to assess long term phenomena such as word resilience. By systematically comparing the results obtained on two subsets of the corpus corresponding to two independent newspapers, we argue that the results obtained are independent of the specificity of the chosen corpus, and are likely to be the results of more general linguistic phenomena.

Presented at:
Digital Humanities 2016, Kraków, Poland, July 11-16, 2016

 Record created 2016-05-03, last modified 2018-03-17

Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)