“Small” Large Language Models in the hospital: an evaluation study on real-world data in a resources-constrained setting
BACKGROUND Large Language Models (LLMs) offer promise for healthcare but face challenges of scale, privacy, and limited evidence in non-English settings. Smaller, locally deployable LLMs remain underexplored. OBJECTIVE To assess the feasibility of small open-source LLMs (1–24B parameters) in French-language clinical tasks and provide a reproducible hospital-based evaluation framework. METHODS Six state-of-the-art small LLMs from the Mistral, Phi-4, Llama-3.1, Meditron-3, Falcon 3 model families were tested in a zero-shot setting on de-identified discharge letters across seven use cases, including information extraction, translation, summarization, and clinical decision support. Performance was measured with F1 scores, readability indices, embedding similarity, and structured clinician reviews. RESULTS The models achieved high recall in simple retrieval tasks (up to 99.6%) but showed poor performance in detection of protected health information, adverse-event extraction, summarization, and decision support. Translation quality varied, with general-purpose models outperforming medical-focused models. CONCLUSIONS In localized resource-constrained deployments, small LLMs are suitable for basic tasks but insufficient for complex reasoning or clinical decision-making. Our framework supports context-specific evaluation for safe adoption in hospitals.
preprint-86453-submitted.pdf
Main Document
Submitted version (Preprint)
openaccess
CC BY
1.16 MB
Adobe PDF
62da7371acc7c19400eed7d6f9d5a633