atul ud paper - jawaharlal nehru...

Post on 26-Feb-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

UNIVERSAL DEPENDENCY TREEBANKS FORLOW-RESOURCE INDIAN LANGUAGES: THECASE OF BHOJPURI

DescriptionResource BuildingSyntactically annotated treebankUD Framework4881 annotated tokensML-based Tagger and ParserData Source: BLTRDomain: news and non-fiction5000 sentences (105,174 tokens)254 sentences(4881 tokens ) manually annotatedXPOS and UPOS tagsSupport: Hindi Treebank | BIS Tagset

Charles UniversityFaculty of Mathematics and Physics

Institute of Formal and Applied Linguisticsshashwatup9k@gmail.com

zeman@ufal.mff.cuni.cz

Atul Kr. Ojha Daniel Zeman

Bhojpuri

57.49% UAS | 45.50% LAS 79.69% UPOS | 77.64% XPOS

Accuracy

Indo_Aryan LanguageBihar | Jharkhand | Uttar PradeshNepal | trinindad | Mauritius | Guyana | Suriname | FijiSpeakers: 50,579,447Resource Poor Language for ML

Statistics of morphological features

Statistics of UPOS tags

UD relations. Out of 37 we use 30

Accuracy of a UDPipe model trained on the Hindi UDtreebank (HDTB) and applied to the first 50 Bhojpurisentences.

UDPipe accuracy of the conducted experiments

Learning curve of the Bhojpuri models

AcknowledgementsThis work has been supported by LINDAT/CLARIAH-CZ andKhresmoi, the grants no. LM2018101 and 7E11042 of theMinistry of Education, Youth and Sports of the CzechRepublic, and FP7-ICT-2010-6-257528 of the EuropeanUnion.

top related