Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Unleashing the power of semantic text analysis: a complex systems approach
 
doctoral thesis

Unleashing the power of semantic text analysis: a complex systems approach

Martini, Andrea  
2018

In the present information era, a huge amount of machine-readable data is available regarding scientific publications. Such unprecedented wealth of data offers the opportunity to investigate science itself as a complex interacting system by means of quantitative approaches. These kind of studies have the potential to provide new insights on the large-scale organization of science and the driving mechanisms underlying its evolution. A particularly important aspect of these data is the semantic information present within publications as it grants access to the concepts used by scientists to describe their findings. Nevertheless, the presence of the so-called buzzwords, \ie terms that are not specific and are used indistinctly in many contexts, hinders the emerging of the thematic organization of scientific articles.

In this Thesis, I resume my original contribution to the problem of leveraging the semantic information contained in a corpus of documents. Specifically, I have developed an information-theoretic measure, based on the maximum entropy principle, to quantify the information content of scientific concepts. This measure provides an objective and powerful way to identify generic concepts acting as buzzwords, which increase the noise present in the semantic similarity between articles. I prove that the removal of generic concepts is beneficial in terms of the sparsity of the similarity network, thus allowing the detection of communities of articles that are related to more specific themes. The same effect is observed when describing the corpus of articles in terms of topics, namely clusters of concepts that compose the papers as a mixture. Moreover, I applied the method to a collection of web documents obtaining a similar effect despite their differences with scientific articles. Regarding the scientific knowledge, another important aspect I examine is the temporal evolution of the concept generality, as it may potentially describe typical patterns in the evolution of concepts that can highlight the way in which they are consumed over time.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-8473
Author(s)
Martini, Andrea  
Advisors
De Los Rios, Paolo  
Jury

Prof. Henrik Moodysson Rønnow (président) ; Prof. Paolo De Los Rios (directeur de thèse) ; Prof. Robert West, Prof. Alessandro Flammini, Prof. Jesus Gomez-Gardeñes (rapporteurs)

Date Issued

2018

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2018-03-16

Thesis number

8473

Total of pages

153

Subjects

Complex systems

•

science of science

•

semantic networks

•

community detection

•

topic modeling

•

maximum entropy principle

•

applied statistical physics

EPFL units
LBS  
Faculty
SB  
School
IPHYS  
Doctoral School
EDPY  
Available on Infoscience
March 15, 2018
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/145581
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés