Quotebank: A Corpus of Quotations from a Decade of News

Vaucher, Timote; Spitz, Andreas; Catasta, Michele; West, Robert

doi:10.1145/3437963.3441760

Vaucher, Timote; Spitz, Andreas; Catasta, Michele; West, Robert

2021

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

We present Quotebank, an open corpus of 178 million quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2008 and 2020. In order to produce this Web-scale corpus, while at the same time benefiting from the performance of modern neural models, we introduce Quobert, a minimally supervised framework for extracting and attributing quotations from massive corpora. Quobert avoids the necessity of manually labeled input and instead exploits the redundancy of the corpus by bootstrapping from a single seed pattern to extract training data for fine-tuning a BERT-based model. Quobert is language- and corpus-agnostic and correctly attributes 86.9% of quotations in our experiments. Quotebank and Quobert are publicly available at https://doi.org/10.5281/zenodo.4277311.

Details

Title Quotebank: A Corpus of Quotations from a Decade of News

Author(s) Vaucher, Timote ; Spitz, Andreas ; Catasta, Michele ; West, Robert

Published in Wsdm '21: Proceedings Of The 14Th Acm International Conference On Web Search And Data Mining

Pages 328-336

Conference 14th ACM International Conference on Web Search and Data Mining (WSDM), Mar 08-12, 2021, ELECTR NETWORK

Date 2021-01-01

Publisher New York, ASSOC COMPUTING MACHINERY

ISBN 978-1-4503-8297-7

DOI https://doi.org/10.1145/3437963.3441760

Other identifier(s) View record in Web of Science

Record Appears in Peer-reviewed publications
Conference Papers
Work produced at EPFL
Published

Record creation date 2022-07-04