resource-light bantu part-of-speech tagging

Post on 21-Aug-2015

354 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RESOURCE-LIGHT BANTU PART-OF-SPEECH TAGGING

Guy De Pauw (UA) Gilles-Maurice de Schryver (UGent) Janneke van de Loo (UA)

Motivation

There are many data-driven taggers available, but

they need extensive annotated corpora.

Unsupervised part-of-speech tagging techniques

for resource-scarce languages exhibit limited

results on Sub-Saharan languages

Becoming increasingly available: digitally

available dictionaries, lexicons, word lists, ...

Research questions

• What information can we use for part-of-

speech tagging?

• Can we use this information to bootstrap

accurate part-of-speech taggers for the

languages under investigation?

• How does this technique compare to the

state-of-the-art in data-driven part-of-

speech tagging?

Bag-of-SubstringsAdamPROPNAME alionekanaV chumbaniN kwakePRON hanaNEG fahamuN .FULL_STOP

Train maximum entropy classifier and compare it to memory-based

tagger

Experimental ResultsConclusion

In the absence of large, annotated corpora, the bag-

of-substrings approach established a low-resource,

high accuracy bootstrapping method for part-of-

speech tagging of conjunctively written Bantu

languages.

Demos

top related