Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction
 
conference paper

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

Josifoski, Martin  
•
Šakota, Marija  
•
Peyrard, Maxime  
Show more
Bouamor, Houda
•
Pino, Juan
Show more
2023
EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
The 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by LLMs: for problems with structured outputs, it is possible to prompt an LLM to perform the task in the reverse direction, by generating plausible input text for a target output structure. Leveraging this asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation, and use it to finetune small models (220M and 770M parameters), termed SynthIE, that outperform the prior state of the art (with equal model size) by a substantial margin of 57 absolute points in micro-F1 and 79 points in macroF1. Code, data, and models are available at https://github.com/epfl-dlab/SynthIE.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

10.18653_v1_2023.emnlp-main.96.pdf

Type

Main Document

Version

Published version

Access type

openaccess

License Condition

CC BY

Size

1.29 MB

Format

Adobe PDF

Checksum (MD5)

72c3f9a3eea1aff96219cdd7fd52eb87

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés