How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives

Cartoni, Bruno; Zufferey, Sandrine; Meyer, Thomas; Popescu-Belis, Andrei

Cartoni, Bruno; Zufferey, Sandrine; Meyer, Thomas; Popescu-Belis, Andrei

2011

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

In this paper, we question the homogeneity of a large parallel corpus by measuring the similarity between various sub-parts. We compare results obtained using a general measure of lexical similarity based on c2 and by counting the number of discourse connectives. We argue that discourse connectives provide a more sensitive measure, revealing differences that are not visible with the general measure. We also provide evidence for the existence of specific characteristics defining translated texts as opposed to nontranslated ones, due to a universal tendency for explicitation.

Details

Title How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives

Author(s) Cartoni, Bruno ; Zufferey, Sandrine ; Meyer, Thomas ; Popescu-Belis, Andrei

Published in BUCC '11: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

Pages 78–86

Conference ACL - 4th Workshop on Building and Using Comparable Corpora, Portland, OR

Date 2011

Keywords

Comparable Corpora; Corpora; discourse connectives; Homogeneity; Measures; Parallel Corpora; Similarity

Laboratories LIDIAP

Record Appears in Scientific production and competences > STI - School of Engineering > IEM - Institut d'Electricité et de Microtechnique > LIDIAP - L'IDIAP Laboratory
Scientific production and competences > Euler Center for Signal Processing
Conference Papers
Work produced at EPFL
Published

Record creation date 2011-07-06

Actions

Preview

Select file: