Responsibly Building Multilingual Language Models for Hundreds of Languages
Large Language Models (LLMs) have emerged as a transformative innovation in artificial intelligence, enabling systems capable of understanding, generating, and reasoning with human language at unprecedented scales. Powered by architectures such as the Transformer and trained on massive text corpora, these models demonstrate impressive generalization across diverse natural language processing tasks, from conversational agents to scientific discovery. However, the development of LLMs has been heavily skewed toward English and a few other high-resource languages, raising critical questions about inclusivity, linguistic fairness, and the equitable applicability of these technologies worldwide.
Multilingual Large Language Models (MLLMs) aim to address these disparities by extending LLM capabilities across hundreds of languages. Despite their promise, MLLMs face substantial challenges. Data scarcity limits performance in low-resource languages, and existing tokenization methods introduce structural biases that favor dominant languages, often over-segmenting text in underrepresented scripts and languages. Moreover, cross-lingual transfer mechanisms in these models remain poorly understood, and evaluation benchmarks tend to focus on high-resource languages, masking deficiencies in less-resourced ones. These issues hinder the goal of building AI systems that equitably serve diverse linguistic and cultural communities.
This thesis investigates these challenges and proposes solutions to advance both the scientific understanding and practical development of MLLMs. First, it analyzes the mechanisms of cross-lingual transfer and representation learning, revealing how MLLMs generalize across typologically diverse languages and identifying factors that enable or impede effective knowledge sharing. Second, it examines the cross-lingual reasoning capabilities of MLLMs in monolingual, multilingual, and code-switched settings, proposing architectural enhancements that incorporate language-specific components to improve reasoning performance.
To address data-related challenges, the thesis introduces ConLID, a novel language identification approach based on supervised contrastive learning, improving domain generalization for low-resource languages. It further investigates the data mixture problem in multilingual pretraining, analyzing how the composition of training data affects cross-lingual performance and proposing strategies to mitigate capacity imbalance across languages. Additionally, the work proposes a parity-aware byte pair encoding (BPE) algorithm designed to ensure more equitable tokenization. Finally, a comprehensive multilingual benchmark covering 44 languages is developed, enabling fine-grained assessment of MLLM performance across both high- and low-resource languages, with attention to regional and cultural content.
Together, these contributions provide both theoretical insights and practical tools for building inclusive, fair, and capable multilingual AI systems. By embedding technical innovations within an ethical framework, this thesis advances the vision of AI technologies that leave no language behind.
École Polytechnique Fédérale de Lausanne
Prof. Tanja Christina Käser Jacober (présidente) ; Prof. Karl Aberer, Prof. Antoine Bosselut (directeurs) ; Prof. Martin Jaggi, Prof. Hinrich Schütze, Prof. Ivan Vulic (rapporteurs)
2025
Lausanne
2025-11-28
10425
303