Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Analyse multi-échelle de n-grammes sur 200 années d'archives de presse
 
doctoral thesis

Analyse multi-échelle de n-grammes sur 200 années d'archives de presse

Buntinx, Vincent  
2017

The recent availability of large corpora of digitized texts over several centuries opens the way to new forms of studies on the evolution of languages. In this thesis, we study a corpus of 4 million press articles covering a period of 200 years. The thesis tries to measure the evolution of written French on this period at the level of words and expressions, but also in a more global way by attempting to define integrated measures of linguistic evolution. The methodological choice is to introduce a minimum of linguistic hypotheses in this study by developing new measures around the simple notion of n-gram, a sequence of n consecutive words. The thesis explores on this basis the potential of already known concepts as temporal frequency profiles and their diachronic correlations, but also introduces new abstractions such as the notion of resilient linguistic kernel or the decomposition of profiles into solidified expressions according to simple statistical models. Through the use of distributed computational techniques, it develops methods to test the relevance of these concepts on a large amount of textual data and thus allows to propose a virtual observatory of the diachronic evolutions associated with a given corpus. On this basis, the thesis explores more precisely the multi-scale dimension of linguistic phenomena by considering how standardized measures evolve when applied to increasingly long n-grams. The discrete and continuous scale from the isolated entities (n=1) to the increasingly complex and structured expressions (1 < n < 10) offers a transversal axis of study to the classical differentiations that ordinarily structure linguistics: syntax, semantics, pragmatics, and so on. The thesis explores the quantitative and qualitative diversity of phenomena at these different scales of language and develops a novel approach by proposing multi-scale measurements and formalizations, with the aim of characterizing more fundamental structural aspects of the studied phenomena.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-8180
Author(s)
Buntinx, Vincent  
Advisors
Kaplan, Frédéric  
•
Xanthos, Aris  
Jury

Prof. Thomas Alois Weber (président) ; Prof. Frédéric Kaplan, Dr Aris Xanthos (directeurs) ; Prof. Robert West, Dr Jean-Baptiste Michel, Prof. Jacques Savoy (rapporteurs)

Date Issued

2017

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2017-12-05

Thesis number

8180

Total of pages

362

Subjects

big data

•

corpus analysis

•

frequency profile

•

linguistic distance

•

linguistic evolution

•

n-grams analysis

•

press corpus

•

resilient kernel

•

word resilience

EPFL units
DHLAB  
Faculty
CDH  
Doctoral School
EDMT  
Available on Infoscience
November 28, 2017
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/142360
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés