Transformer-Based Multi-lingual Sentence Embeddings

Pei, Wang

Pei, Wang

2019

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

In this thesis, we present a transformers-based multi-lingual embedding model to represent sentences in different languages in a common space. To do so, our system uses the structure of a simplified transformer with a shared byte-pair encoding vocabulary for two languages (English and French) and trained on publicly available parallel corpora. Also, new objective losses have experimented including a cross-lingual loss and a sentence alignment loss for presenting better representation quality. We evaluate our generated sentence representations on the sentence retrieval task from MUSE, multi-lingual zero-shot document classification and natural language inference task from MLDoc and XNLI respectively compared with competitors like Bi-Bert2Vec (Sabet et al., 2020, LASER (Artetxe and Schwenk, 2019 and Multi-lingual BERT (mBERT proposed by Devlin et al., 2018). Our proposed model obtains state-of-art results on the cross-lingual sentence retrieval task and it outperforms other competitors like Bi-Bert2Vec and LASER on the MLDoc task (Schwenk and Li, 2018) as well. We also experiment with model architectures, objectives and the tensors used to represent sentences and then proposed a new sentence alignment loss which has a positive impact on the quality of sentence representation.

Details

Title Transformer-Based Multi-lingual Sentence Embeddings

Author(s) Pei, Wang

Advisor(s)

Jaggi, Martin
Gupta, Prakhar
Pagliardini, Matteo

Date 2019

Keywords

NLP; multi-lingual; cross-lingual; sentence embeddings; transformers

Laboratories MLO

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > MLO - Machine Learning and Optimization Laboratory
Work produced at EPFL
Student projects

Work type Master's Thesis

Record creation date 2020-05-17

Actions

Preview

Select file: