Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Intégration de connaissances syntaxiques et sémantiques dans les représentations vectorielles de textes : [application au calcul de similarités sémantiques dans le cadre du modèle DSIR]
 
doctoral thesis

Intégration de connaissances syntaxiques et sémantiques dans les représentations vectorielles de textes : [application au calcul de similarités sémantiques dans le cadre du modèle DSIR]

Besançon, Romaric  
2002

The notion of similarity between texts is fundamental for many applications of Natural Language Processing. For example, this notion is particularly useful for the applications designed for the management of information in large textual databases, such as Information Retrieval or Automatic Text Structuring. Information Retrieval is the search of the most relevant documents according to an information need expressed by a query, and can be implemented by the search of the documents most similar to the query. Automatic Text Structuring is often viewed as the clustering of documents according to their similarity measures. The similarity between documents relies on their representation. The most used textual representation is the Vector Space model, in which each document is represented by a vector, and the similarity between documents is then computed by a distance measure in this space, for instance, the cosine of the vectors representing the documents. We first present several vector space models used for the computation of similarities between documents. Then, we focus on the problem of the integration of additional knowledge in the vector space representation, and the impact of this integration on the results obtained for several tasks. We fist consider the integration of co-occurrences in the representation model, and we focus on the DSIR model (Distributional Semantics based In-formation Retrieval). We show that this model has a probabilistic theoretical basis. We then consider the use of syntactic information to compute the co-occurrence frequencies. We also consider the integration of knowledge about compounds in the representation, taking into account morpho-syntactic and semantic variants of the considered compounds. We finally address the issue of word sense disambiguation, using synonymy relations to derive a vector space representation for which each dimension is associated to a meaning and not to a term. For all these methods, we propose several evaluations : we first consider a validation for the notion of similarity derived from a vector space representation, in a multi-lingual framework : the idea is to verify that the similarity between two documents in one language is close to the similarity between their translations in another language. We also propose an evaluation of the different models considered in a standard Information Retrieval evaluation framework. We finally consider the evaluation of the models on a Word Sense Disambiguation task.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH2508.pdf

Access type

restricted

Size

2.96 MB

Format

Adobe PDF

Checksum (MD5)

8ce4cdbd1411c2f65459ec322921a966

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés