Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Don't Stop Pretraining! Efficiently Building Specialised Language Models in Resource-constrained Settings.
 
conference paper

Don't Stop Pretraining! Efficiently Building Specialised Language Models in Resource-constrained Settings.

Najem-Meyer, Sven  
•
Kaplan, Frederic  
•
Romanello, Matteo
Kazantseva, A
•
Szpakowicz, S
Show more
January 1, 2025
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)
9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature-LaTeCH-CLfL

Developing specialised language models for low-resource domains typically involves a trade-off between two specialisation strategies: adapting a general-purpose model through continued pretraining or retraining a model from scratch. While adapting preserves the model's linguistic knowledge, retraining benefits from the flexibility of an in-domain tokeniser - a potentially significant advantage when handling rare languages. This study investigates the impact of tokenisation, specialisation strategy, and pretraining data availability using classical scholarship - a multilingual, code-switching and highly domain-specific field - as a case study. Through extensive experiments, we assess whether domain-specific tokenisation improves model performance, whether characterbased models provide a viable alternative to subword-based models, and which specialisation strategy is optimal given the constraints of limited pretraining data. Contrary to prior findings, our results show that in-domain tokenisation does not necessarily enhance performance. Most notably, adaptation consistently outperforms retraining, even with limited data, confirming its efficiency as the preferred strategy for resource-constrained domains. These insights provide valuable guidelines for developing specialised models in fields with limited textual resources.

  • Details
  • Metrics
Type
conference paper
Web of Science ID

WOS:001667459200022

Author(s)
Najem-Meyer, Sven  

École Polytechnique Fédérale de Lausanne

Kaplan, Frederic  

École Polytechnique Fédérale de Lausanne

Romanello, Matteo

University of Lausanne

Editors
Kazantseva, A
•
Szpakowicz, S
•
Degaetano-Ortlieb, S
•
Bizzoni, Y
•
Pagel, J
Date Issued

2025-01-01

Publisher

Association for Computational Linguistics

Publisher place

Stroudsburg

Published in
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)
DOI of the book
https://doi.org/10.18653/v1/2025.latechclfl-1
ISBN of the book

979-8-89176-241-1

Start page

252

End page

260

URL

Link ACL Anthology

https://aclanthology.org/volumes/2025.latechclfl-1/
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
MLO  
DHLAB  
Event nameEvent acronymEvent placeEvent date
9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature-LaTeCH-CLfL

Albuquerque, NM

2025-05-04

FunderFunding(s)Grant NumberGrant URL

Swiss National Science Foundation (SNSF)

PZ00P1_186033

Available on Infoscience
February 24, 2026
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/260688
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés