Using the Europarl corpus for cross-linguistic research

Cartoni, Bruno; Zufferey, Sandrine; Meyer, Thomas

doi:10.1075/bjl.27.02car

research article

Using the Europarl corpus for cross-linguistic research

Cartoni, Bruno

•

Zufferey, Sandrine

•

Meyer, Thomas

2013

Belgian Journal of Linguistics

Europarl is a large multilingual corpus containing the minutes of the debates at the European Parliament. This article presents a method to extract different corpora from Europarl: monolingual and multilingual comparable corpora, as well as parallel corpora. Using state-of-the-art measures of homogeneity, we show that these corpora are very similar. In addition, we argue that they present many advantages for research in various fields of linguistics and translation studies, and we also discuss some of their limitations. We conclude by reviewing a number of previous studies that made use of these corpora, emphasizing in each case the possibilities offered by Europarl.

Type

research article

DOI

10.1075/bjl.27.02car

Authors

Cartoni, Bruno

•

Zufferey, Sandrine

•

Meyer, Thomas

Publication date

2013

Published in

Belgian Journal of Linguistics

Volume

27

Issue

1

Start page

23

End page

42

Subjects

discourse connectives...

Parallel Corpora

Peer reviewed

NON-REVIEWED

EPFL units

LIDIAP

Available on Infoscience

December 19, 2013

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/98086