atul ud paper - jawaharlal nehru...

1
UNIVERSAL DEPENDENCY TREEBANKS FOR LOW-RESOURCE INDIAN LANGUAGES: THE CASE OF BHOJPURI

Upload: others

Post on 26-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Atul UD Paper - Jawaharlal Nehru Universitysanskrit.jnu.ac.in/conf/wildre5/slides/wildre-5_2020-0.7.pdf · 2020. 5. 25. · English-Bhojpuri machine translation system (Ojha, 2019)

UNIVERSAL DEPENDENCY TREEBANKS FORLOW-RESOURCE INDIAN LANGUAGES: THECASE OF BHOJPURI

DescriptionResource BuildingSyntactically annotated treebankUD Framework4881 annotated tokensML-based Tagger and ParserData Source: BLTRDomain: news and non-fiction5000 sentences (105,174 tokens)254 sentences(4881 tokens ) manually annotatedXPOS and UPOS tagsSupport: Hindi Treebank | BIS Tagset

Charles UniversityFaculty of Mathematics and Physics

Institute of Formal and Applied [email protected]

[email protected]

Atul Kr. Ojha Daniel Zeman

Bhojpuri

57.49% UAS | 45.50% LAS 79.69% UPOS | 77.64% XPOS

Accuracy

Indo_Aryan LanguageBihar | Jharkhand | Uttar PradeshNepal | trinindad | Mauritius | Guyana | Suriname | FijiSpeakers: 50,579,447Resource Poor Language for ML

Statistics of morphological features

Statistics of UPOS tags

UD relations. Out of 37 we use 30

Accuracy of a UDPipe model trained on the Hindi UDtreebank (HDTB) and applied to the first 50 Bhojpurisentences.

UDPipe accuracy of the conducted experiments

Learning curve of the Bhojpuri models

AcknowledgementsThis work has been supported by LINDAT/CLARIAH-CZ andKhresmoi, the grants no. LM2018101 and 7E11042 of theMinistry of Education, Youth and Sports of the CzechRepublic, and FP7-ICT-2010-6-257528 of the EuropeanUnion.