Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Learning computationally efficient static word and sentence representations
 
doctoral thesis

Learning computationally efficient static word and sentence representations

Gupta, Prakhar  
2021

Most of the Natural Language Processing (NLP) algorithms involve use of distributed vector representations of linguistic units (primarily words and sentences) also known as embeddings in one way or another. These embeddings come in two flavours namely, static/non-contextual and contextual. In a static embedding, the vector representation of a word is independent of its context as opposed to a contextual embedding where the word representation incorporates additional information from its surrounding context.

Recently, advancements in deep learning when applied to contextual embeddings have seen them outperforming their static counterparts. However, this improvement in performance with respect to that of the static embeddings has come at the cost of lesser computational efficiency in terms of both computational resources as well as training and inference times, relative lack of interpretability, and higher costs to the environment. Consequently, static embedding models despite not being as expressive and powerful as contextual embedding models continue to be of relevance in Natural Language Processing Research.

In this thesis, we propose improvements to the current state-of-the-art static word embedding and sentence embedding models in three different settings. Firstly, we propose an improved algorithm to learn word and sentence embedding from raw text by proposing changes to the Word2Vec training objective formulation and adding n-grams to the training to incorporate local contextual information. Consequently, we end up obtaining improved unsupervised static word and sentence embeddings. Our second major contribution is learning cross-lingual static word and sentence representations from parallel bilingual data where two corpora are aligned sentence-wise. Our word and sentence embeddings thus obtained outperform other bag-of-words bilingual embeddings on cross-lingual sentence retrieval and monolingual word similarity tasks while staying competitive with them on cross-lingual word translation tasks. In our last major contribution, we aim towards harnessing the expressive power of the contextual embedding models by distilling static word embeddings from contextual embedding models to use improved word representations for computationally light tasks. This allows us to utilize the semantic information possessed by the contextual embedding models while maintaining computational efficiency for inference tasks at the same time.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-7959
Author(s)
Gupta, Prakhar  
Advisors
Jaggi, Martin  
Jury

Prof. Karl Aberer (président) ; Prof. Martin Jaggi (directeur de thèse) ; Dr. James Henderson, Dr. Michael Auli, Dr. Fabio Rinaldi (rapporteurs)

Date Issued

2021

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2021-11-26

Thesis number

7959

Total of pages

109

Subjects

Machine learning

•

Natural Language Processing

•

Representation learning

•

Word representations

•

Sentence Representations

•

Distributional semantics.

EPFL units
MLO  
Faculty
IC  
School
IINFCOM  
Doctoral School
EDIC  
Available on Infoscience
November 22, 2021
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/183175
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés