From Preprocessing to Evaluation: The Role of External Knowledge in Advancing Language Models and Machine Translation

Vernikos, Georgios

doi:10.5075/epfl-thesis-10830

doctoral thesis

From Preprocessing to Evaluation: The Role of External Knowledge in Advancing Language Models and Machine Translation

2025

Machine translation (MT) systems and language models (LMs) have demonstrated impressive performance in recent years, driven by advances in architecture, algorithms, and the scaling of parameters and training data. Despite the ability of these models to internalize extensive knowledge, there are still scenarios where they can benefit from information that was either absent from their training data or too difficult to learn, and to which we refer as external knowledge. This thesis systematically examines the integration of external knowledge at three key stages of the MT and LLM lifecycle: preprocessing, inference, and evaluation.

Part I investigates the use of external knowledge during the preprocessing phase, focusing on tokenization in multilingual systems. Existing methods rely on shared scripts to map tokens with identical surface forms to the same representation, creating anchors across languages. We propose an alternative approach that identifies cross-lingual anchors using lexica derived from monolingual data. These lexica capture semantic relationships between tokens, eliminating the need for shared scripts or surface forms. This approach significantly enhances the transfer of monolingual LMs to new languages, improving both efficiency and performance. Furthermore, it performs on par with, and in some cases surpasses, traditional approaches in MT.

In Part II we explore the integration of external knowledge during inference. Fine-tuning large-scale LMs and MT models is often computationally prohibitive or restricted due to access limitations. To address this, we propose methods to incorporate task-specific knowledge directly at inference time. The first approach introduces a small LM that is trained to refine the predictions of a large LM (LLM), without requiring access to its weights. This plug-and-play module incorporates external knowledge from previously unseen training data, enhancing the performance of LLMs in various text generation tasks, including MT. The second approach uses quality estimation (QE) metrics, that reflect human preferences, to synthesize improved translations for LLMs and MT models. By merging spans from multiple candidate translations based on QE scores, this method consistently improves translation quality, outperforming reranking algorithms across diverse models and language pairs.

Part III focuses on the last stage of a model's lifecycle: evaluation. Current neural-based metrics assess translation quality at the sentence level, ignoring the broader document-level context that is available to human evaluators. To address this gap, we extend state-of-the-art neural-based MT metrics by incorporating document-level context as external knowledge. This enhancement allows MT metrics to identify context-related errors, leading to stronger correlation with human judgments.

Overall, this thesis presents a systematic exploration of integrating external knowledge at different stages of a model's lifecycle: preprocessing, inference, and evaluation. We demonstrate how these knowledge sources can effectively complement the internal knowledge of large-scale LMs and MT systems and enhance their performance in MT and other text generation tasks.

Name

EPFL_TH10830.pdf

Type

Main Document

Version

http://purl.org/coar/version/c_be7fb7dd8ff6fe43

Access type

openaccess

License Condition

N/A

Size

2.2 MB

Format

Adobe PDF

Checksum (MD5)

67bb9b6bc1ce73ddb284d31fae639ff2