Studying Linguistic Changes on 200 Years of Newspapers
Large databases of scanned newspapers open new avenues for studying linguistic evolution. By studying a two-billion-word corpus corresponding to 200 years of newspapers, we compare several methods in order to assess how fast language is changing. After critically evaluating an initial set of methods for assessing textual distance between subsets corresponding to consecutive years, we introduce the notion of a lexical kernel, the set of unique words that maintain themselves over long periods of time. Focusing on linguistic stability instead of linguistic change allows building more robust measures to assess long term phenomena such as word resilience. By systematically comparing the results obtained on two subsets of the corpus corresponding to two independent newspapers, we argue that the results obtained are independent of the specificity of the chosen corpus, and are likely to be the results of more general linguistic phenomena.