Don't Stop Pretraining! Efficiently Building Specialised Language Models in Resource-constrained Settings.
Developing specialised language models for low-resource domains typically involves a trade-off between two specialisation strategies: adapting a general-purpose model through continued pretraining or retraining a model from scratch. While adapting preserves the model's linguistic knowledge, retraining benefits from the flexibility of an in-domain tokeniser - a potentially significant advantage when handling rare languages. This study investigates the impact of tokenisation, specialisation strategy, and pretraining data availability using classical scholarship - a multilingual, code-switching and highly domain-specific field - as a case study. Through extensive experiments, we assess whether domain-specific tokenisation improves model performance, whether characterbased models provide a viable alternative to subword-based models, and which specialisation strategy is optimal given the constraints of limited pretraining data. Contrary to prior findings, our results show that in-domain tokenisation does not necessarily enhance performance. Most notably, adaptation consistently outperforms retraining, even with limited data, confirming its efficiency as the preferred strategy for resource-constrained domains. These insights provide valuable guidelines for developing specialised models in fields with limited textual resources.
WOS:001667459200022
École Polytechnique Fédérale de Lausanne
École Polytechnique Fédérale de Lausanne
University of Lausanne
2025-01-01
Stroudsburg
979-8-89176-241-1
252
260
Link ACL Anthology
REVIEWED
EPFL
| Event name | Event acronym | Event place | Event date |
Albuquerque, NM | 2025-05-04 | ||