Infoscience

Thesis

Analyse multi-échelle de n-grammes sur 200 années d'archives de presse

The recent availability of large corpora of digitized texts over several centuries opens the way to new forms of studies on the evolution of languages. In this thesis, we study a corpus of 4 million press articles covering a period of 200 years. The thesis tries to measure the evolution of written French on this period at the level of words and expressions, but also in a more global way by attempting to define integrated measures of linguistic evolution. The methodological choice is to introduce a minimum of linguistic hypotheses in this study by developing new measures around the simple notion of n-gram, a sequence of n consecutive words. The thesis explores on this basis the potential of already known concepts as temporal frequency profiles and their diachronic correlations, but also introduces new abstractions such as the notion of resilient linguistic kernel or the decomposition of profiles into solidified expressions according to simple statistical models. Through the use of distributed computational techniques, it develops methods to test the relevance of these concepts on a large amount of textual data and thus allows to propose a virtual observatory of the diachronic evolutions associated with a given corpus. On this basis, the thesis explores more precisely the multi-scale dimension of linguistic phenomena by considering how standardized measures evolve when applied to increasingly long n-grams. The discrete and continuous scale from the isolated entities (n=1) to the increasingly complex and structured expressions (1 < n < 10) offers a transversal axis of study to the classical differentiations that ordinarily structure linguistics: syntax, semantics, pragmatics, and so on. The thesis explores the quantitative and qualitative diversity of phenomena at these different scales of language and develops a novel approach by proposing multi-scale measurements and formalizations, with the aim of characterizing more fundamental structural aspects of the studied phenomena.

Related material