Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Responsibly Building Multilingual Language Models for Hundreds of Languages
 
doctoral thesis

Responsibly Building Multilingual Language Models for Hundreds of Languages

Foroutan Eghlidi, Negar  
2025

Large Language Models (LLMs) have emerged as a transformative innovation in artificial intelligence, enabling systems capable of understanding, generating, and reasoning with human language at unprecedented scales. Powered by architectures such as the Transformer and trained on massive text corpora, these models demonstrate impressive generalization across diverse natural language processing tasks, from conversational agents to scientific discovery. However, the development of LLMs has been heavily skewed toward English and a few other high-resource languages, raising critical questions about inclusivity, linguistic fairness, and the equitable applicability of these technologies worldwide.

Multilingual Large Language Models (MLLMs) aim to address these disparities by extending LLM capabilities across hundreds of languages. Despite their promise, MLLMs face substantial challenges. Data scarcity limits performance in low-resource languages, and existing tokenization methods introduce structural biases that favor dominant languages, often over-segmenting text in underrepresented scripts and languages. Moreover, cross-lingual transfer mechanisms in these models remain poorly understood, and evaluation benchmarks tend to focus on high-resource languages, masking deficiencies in less-resourced ones. These issues hinder the goal of building AI systems that equitably serve diverse linguistic and cultural communities.

This thesis investigates these challenges and proposes solutions to advance both the scientific understanding and practical development of MLLMs. First, it analyzes the mechanisms of cross-lingual transfer and representation learning, revealing how MLLMs generalize across typologically diverse languages and identifying factors that enable or impede effective knowledge sharing. Second, it examines the cross-lingual reasoning capabilities of MLLMs in monolingual, multilingual, and code-switched settings, proposing architectural enhancements that incorporate language-specific components to improve reasoning performance.

To address data-related challenges, the thesis introduces ConLID, a novel language identification approach based on supervised contrastive learning, improving domain generalization for low-resource languages. It further investigates the data mixture problem in multilingual pretraining, analyzing how the composition of training data affects cross-lingual performance and proposing strategies to mitigate capacity imbalance across languages. Additionally, the work proposes a parity-aware byte pair encoding (BPE) algorithm designed to ensure more equitable tokenization. Finally, a comprehensive multilingual benchmark covering 44 languages is developed, enabling fine-grained assessment of MLLM performance across both high- and low-resource languages, with attention to regional and cultural content.

Together, these contributions provide both theoretical insights and practical tools for building inclusive, fair, and capable multilingual AI systems. By embedding technical innovations within an ethical framework, this thesis advances the vision of AI technologies that leave no language behind.

  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-10425
Author(s)
Foroutan Eghlidi, Negar  

École Polytechnique Fédérale de Lausanne

Advisors
Aberer, Karl  
•
Bosselut, Antoine  
Jury

Prof. Tanja Christina Käser Jacober (présidente) ; Prof. Karl Aberer, Prof. Antoine Bosselut (directeurs) ; Prof. Martin Jaggi, Prof. Hinrich Schütze, Prof. Ivan Vulic (rapporteurs)

Date Issued

2025

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2025-11-28

Thesis number

10425

Total of pages

303

Subjects

Multilingual Large Language Models

•

Cross-Lingual Transfer

•

Language Identification

•

Multilingual Data Mixture

•

Tokenization

•

Multilingual Evaluation

EPFL units
LSIR  
Faculty
IC  
School
IINFCOM  
Doctoral School
EDIC  
Available on Infoscience
November 24, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/256280
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés