Files

Abstract

In the present information era, a huge amount of machine-readable data is available regarding scientific publications. Such unprecedented wealth of data offers the opportunity to investigate science itself as a complex interacting system by means of quantitative approaches. These kind of studies have the potential to provide new insights on the large-scale organization of science and the driving mechanisms underlying its evolution. A particularly important aspect of these data is the semantic information present within publications as it grants access to the concepts used by scientists to describe their findings. Nevertheless, the presence of the so-called buzzwords, \ie terms that are not specific and are used indistinctly in many contexts, hinders the emerging of the thematic organization of scientific articles. In this Thesis, I resume my original contribution to the problem of leveraging the semantic information contained in a corpus of documents. Specifically, I have developed an information-theoretic measure, based on the maximum entropy principle, to quantify the information content of scientific concepts. This measure provides an objective and powerful way to identify generic concepts acting as buzzwords, which increase the noise present in the semantic similarity between articles. I prove that the removal of generic concepts is beneficial in terms of the sparsity of the similarity network, thus allowing the detection of communities of articles that are related to more specific themes. The same effect is observed when describing the corpus of articles in terms of topics, namely clusters of concepts that compose the papers as a mixture. Moreover, I applied the method to a collection of web documents obtaining a similar effect despite their differences with scientific articles. Regarding the scientific knowledge, another important aspect I examine is the temporal evolution of the concept generality, as it may potentially describe typical patterns in the evolution of concepts that can highlight the way in which they are consumed over time.

Details

PDF