Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation
 
conference paper

OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation

Parida, Shantipriya
•
Dash, Satya Ranjan
•
Bojar, Ondrej
Show more
2020
Proceedings of the WILDRE5 - 5th Workshop on Indian Language Data: Resources and Evaluation
WILDRE5 - 5th Workshop on Indian Language Data: Resources and Evaluation

The preparation of parallel corpora is a challenging task, particularly for languages that suffer from under-representation in the digital world. In a multi-lingual country like India, the need for such parallel corpora is stringent for several low-resource languages. In this work, we provide an extended English-Odia parallel corpus, OdiEnCorp 2.0, aiming particularly at Neural Machine Translation (NMT) system which will help translate English↔Odia. OdiEnCorp 2.0 includes existing English-Odia corpora and we extended the collection by several other methods of data acquisition: parallel data scraping from many websites, including Odia Wikipedia, but also optical character recognition (OCR) to extract parallel data from scanned images. Our OCR-based data extraction approach for building a parallel corpus is suitable for other low resource languages that lack in online content. The resulting OdiEnCorp 2.0 contains 98,302 sentences and 1.69million English and 1.47 million Odia tokens. To the best of our knowledge, OdiEnCorp 2.0 is the largest Odia-English parallel corpus covering different domains and available freely for non-commercial and research purposes

  • Details
  • Metrics
Type
conference paper
Author(s)
Parida, Shantipriya
Dash, Satya Ranjan
Bojar, Ondrej
Motlicek, Petr
Pattnaik, Priyanka
Mallick, Debasish Kumar
Date Issued

2020

Publisher

European Language Resources Association (ELRA)

Published in
Proceedings of the WILDRE5 - 5th Workshop on Indian Language Data: Resources and Evaluation
ISBN of the book

979-10-95546-67-2

Start page

14

End page

19

URL

Link to IDIAP database

http://publications.idiap.ch/downloads/papers/2020/Parida_ELRA_2020.pdf

Link to paper

https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/WILDRE-5book.pdf
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LIDIAP  
Event nameEvent place
WILDRE5 - 5th Workshop on Indian Language Data: Resources and Evaluation

11–16 May 2020

Available on Infoscience
July 23, 2020
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/170335
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés