Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

Peyrard, Maxime

doi:10.18653/v1/P19-1502

Peyrard, Maxime

2019

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/FAC shared tasks. However, modem systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic because metrics disagree yet we can't decide which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over which metrics to trust. This would also be greatly beneficial to further improve summarization systems and metrics alike.

Details

Title Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

Author(s) Peyrard, Maxime

Published in 57Th Annual Meeting Of The Association For Computational Linguistics (Acl 2019)

Pages 5093-5100

Conference 57th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Jul 28-Aug 02, 2019, Florence, ITALY

Date 2019-01-01

Publisher Stroudsburg, ASSOC COMPUTATIONAL LINGUISTICS-ACL

ISBN 978-1-950737-48-2

DOI https://doi.org/10.18653/v1/P19-1502

Other identifier(s) View record in Web of Science

Laboratories DLAB

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > DLAB - Data Science Laboratory
Peer-reviewed publications
Conference Papers
Work produced at EPFL
Published

Record creation date 2019-11-16

Abstract

Details

Actions