Discovering meaningful units from text sequences

Behjati, Melika

doi:10.5075/epfl-thesis-10383

doctoral thesis

Discovering meaningful units from text sequences

2024

In recent years, the field of Natural Language Processing has seen significant revolution by the introduction of Transformers, a stack of multiple layers of attention and non-linearity, capable of performing almost any task and the backbone for large foundation models.
In contrast to traditional NLP pipelines, this architecture is able to learn the features required to perform a specific task without any assumptions about the structure of language. The only remaining hard-coded aspect is the way the text input is fed to these models. This preprocessing step, known as the tokenization step, divides the input into chunks which could be as fine-grained as bytes or as coarse-grained as words. The most popular approach is to use subwords, such as Byte Pair Encodings or word pieces which lie between characters and words. However, it has been shown that hard-coding the input representations, has its own drawbacks. In particular, it would lead to sub-optimal performance in downstream tasks. In addition, to perform different tasks we need different levels of representations. In this thesis, we define and address the novel task of inducing units from text in an unsupervised manner. This work is a step towards completely end-to-end models which can decide which level of representation is the most suitable for them to perform a specific task.

Our contributions are two-fold:
First, we design models which are able to induce units without supervision at different levels. And second, since the task is novel, we need novel evaluations to show its effectiveness. Therefore, for every model we develop, we design and/or gather the set of tasks which evaluate and interpret the performance of our models.

In the first chapter, we design a model to induce morpheme-like units from a sequence of characters. We adapt a method from object discovery in vision, called Slot Attention for our purpose. We propose to evaluate this model by introducing bi-directional probing evaluation. In the second chapter, we design a model which induces word-like units from a sequence of characters by integrating non-parametric variational information bottleneck in the last layers of a transformer encoder. In the next chapter, we move to the multi-modal domain and starting from subwords, we design a model which induces phrases from image captions by aligning them to the objects in the image. Lastly, we explore a task-driven approach towards inducing entities.

Type

doctoral thesis

DOI

10.5075/epfl-thesis-10383

Author(s)

Behjati, Melika

EPFL

Advisors

Alahi, Alexandre

•

Henderson, James

Jury

Prof. Caglar Gulcehre (président) ; Prof. Alexandre Massoud Alahi, Dr James Henderson (directeurs) ; Prof. Martin Jaggi, Prof. Ivan Titov, Prof. Rico Sennrich (rapporteurs)

Date Issued

2024

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2024-10-25

Thesis number

10383

Total of pages

128

Subjects

entity induction

•

object discovery

•

interpretability

•

grounding

•

representation learning

•

natural language understanding

•

NLP

•

deep learning

•

machine learning

•

artificial intelligence

EPFL units

Faculty

School

Doctoral School

Available on Infoscience

October 28, 2024

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/241742