Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Bridging the Data Gap: Using LLMs to Augment Datasets for Text Classification
 
conference paper

Bridging the Data Gap: Using LLMs to Augment Datasets for Text Classification

Neshaei, Seyed Parsa  
•
Davis, Richard Lee
•
Mejia-Domenzain, Paola  
Show more
Mills, Caitlin
•
Alexandron, Giora
Show more
July 12, 2025
Proceedings of the 18th International Conference on Educational Data Mining, EDM 2025
18th International Conference on Educational Data Mining

Deep learning models for text classification have been increasingly used in intelligent tutoring systems and educational writing assistants. However, the scarcity of data in many educational settings, as well as certain imbalances in counts among the annotated labels of educational datasets, limits the generalizability and expressiveness of classification models. Recent research positions LLMs as promising solutions to mitigate the data scarcity issues in education. In this paper, we provide a systematic literature review of recent approaches based on LLMs for generating textual data and augmenting training datasets in the broad areas of natural language processing and educational technology research. We analyze how prior works have approached data augmentation and generation across multiple steps of the model training process, and present a taxonomy consisting of a five-stage pipeline. Each stage covers a set of possible options representing decisions in the data augmentation process. We then apply a subset of the identified methods to three educational datasets across different domains and source languages to measure the effectiveness of the suggested augmentation approaches in educational contexts, finding improvements in overall balanced accuracy across all three datasets. Based on our findings, we propose our pipeline as a conceptual framework for future researchers aiming to augment educational datasets for improving classification accuracy1.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

2025.EDM.long-papers.54.pdf

Type

Main Document

Version

Published version

Access type

openaccess

License Condition

CC BY

Size

1.85 MB

Format

Adobe PDF

Checksum (MD5)

8742d22f23425eb0a3aca56bfe47274b

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés