Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Discovering meaningful units from text sequences
 
doctoral thesis

Discovering meaningful units from text sequences

Behjati, Melika  
2024

In recent years, the field of Natural Language Processing has seen significant revolution by the introduction of Transformers, a stack of multiple layers of attention and non-linearity, capable of performing almost any task and the backbone for large foundation models.
In contrast to traditional NLP pipelines, this architecture is able to learn the features required to perform a specific task without any assumptions about the structure of language. The only remaining hard-coded aspect is the way the text input is fed to these models. This preprocessing step, known as the tokenization step, divides the input into chunks which could be as fine-grained as bytes or as coarse-grained as words. The most popular approach is to use subwords, such as Byte Pair Encodings or word pieces which lie between characters and words. However, it has been shown that hard-coding the input representations, has its own drawbacks. In particular, it would lead to sub-optimal performance in downstream tasks. In addition, to perform different tasks we need different levels of representations. In this thesis, we define and address the novel task of inducing units from text in an unsupervised manner. This work is a step towards completely end-to-end models which can decide which level of representation is the most suitable for them to perform a specific task.

Our contributions are two-fold:
First, we design models which are able to induce units without supervision at different levels. And second, since the task is novel, we need novel evaluations to show its effectiveness. Therefore, for every model we develop, we design and/or gather the set of tasks which evaluate and interpret the performance of our models.

In the first chapter, we design a model to induce morpheme-like units from a sequence of characters. We adapt a method from object discovery in vision, called Slot Attention for our purpose. We propose to evaluate this model by introducing bi-directional probing evaluation. In the second chapter, we design a model which induces word-like units from a sequence of characters by integrating non-parametric variational information bottleneck in the last layers of a transformer encoder. In the next chapter, we move to the multi-modal domain and starting from subwords, we design a model which induces phrases from image captions by aligning them to the objects in the image. Lastly, we explore a task-driven approach towards inducing entities.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-10383
Author(s)
Behjati, Melika  

EPFL

Advisors
Alahi, Alexandre  
•
Henderson, James  
Jury

Prof. Caglar Gulcehre (président) ; Prof. Alexandre Massoud Alahi, Dr James Henderson (directeurs) ; Prof. Martin Jaggi, Prof. Ivan Titov, Prof. Rico Sennrich (rapporteurs)

Date Issued

2024

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2024-10-25

Thesis number

10383

Total of pages

128

Subjects

entity induction

•

object discovery

•

interpretability

•

grounding

•

representation learning

•

natural language understanding

•

NLP

•

deep learning

•

machine learning

•

artificial intelligence

EPFL units
VITA  
LIDIAP  
Faculty
ENAC  
School
IIC  
Doctoral School
EDIC  
Available on Infoscience
October 28, 2024
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/241742
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés