Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Further results on latent discourse models and word embeddings
 
research article

Further results on latent discourse models and word embeddings

Khalife, Sammy
•
Goncalves, Douglas
•
Allouah, Youssef  
Show more
January 1, 2021
Journal of Machine Learning Research

We discuss some properties of generative models for word embeddings. Namely, (Arora et al., 2016) proposed a latent discourse model implying the concentration of the partition function of the word vectors. This concentration phenomenon led to an asymptotic linear relation between the pointwise mutual information (PMI) of pairs of words and the scalar product of their vectors. Here, we first revisit this concentration phenomenon and prove it under slightly weaker assumptions, for a set of random vectors symmetrically distributed around the origin. Second, we empirically evaluate the relation between PMI and scalar products of word vectors satisfying the concentration property. Our empirical results indicate that, in practice, this relation does not hold with arbitrarily small error. This observation is further supported by two theoretical results: (i) the error cannot be exactly zero because the corresponding shifted PMI matrix cannot be positive semidefinite; (ii) under mild assumptions, there exist pairs of words for which the error cannot be close to zero. We deduce that either natural language does not follow the assumptions of the considered generative model, or the current word vector generation methods do not allow the construction of the hypothesized word embeddings.

  • Details
  • Metrics
Type
research article
Web of Science ID

WOS:000765311400001

Author(s)
Khalife, Sammy
Goncalves, Douglas
Allouah, Youssef  
Liberti, Leo
Date Issued

2021-01-01

Publisher

Microtome Publishing

Published in
Journal of Machine Learning Research
Volume

22

Subjects

Automation & Control Systems

•

Computer Science, Artificial Intelligence

•

Computer Science

•

generative models

•

latent variable models

•

asymptotic concentration

•

natural language processing

•

matrix factorization

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
DCL  
Available on Infoscience
March 28, 2022
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/186692
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés