keynote - computational processing of arabic dialects: challenges, advances and future directions

63
Computa(onal Processing of Arabic Dialects: Challenges, Advances & Future Direc(ons Keynote The 2nd Workshop on Arabic Corpora and Processing Tools LREC May 24, 2016 Nizar Habash New York University Abu Dhabi [email protected] CAMeL Lab

Upload: iwanrg

Post on 14-Apr-2017

259 views

Category:

Education


7 download

TRANSCRIPT

Page 1: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Computa(onalProcessingofArabicDialects:Challenges,Advances&FutureDirec(ons

KeynoteThe2ndWorkshoponArabicCorporaandProcessingTools

LRECMay24,2016

NizarHabashNewYorkUniversityAbuDhabi

[email protected]

CAMeL Lab

Page 2: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

2

Roadmap

• Introduc(on• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons

Page 3: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

3

IntroducBon•  FormsofArabic

–  ClassicalArabic(CA)•  ClassicalHistoricaltexts•  Liturgicaltexts

–  ModernStandardArabic(MSA)•  Newsmedia&formalspeechesandsePngs•  OnlywriQenstandard

–  DialectalArabic(DA)•  Predominantlyspokenvernaculars•  NowriQenstandards

•  Dialectvs.Language

Page 4: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

ArabicanditsDialects•  Officiallanguage:ModernStandardArabic(MSA)

Ø Noone’snaBvelanguage•  Whatisa‘dialect’?

–  PoliBcalandReligiousfactors•  RegionalDialects

–  EgypBanArabic(EGY)–  LevanBneArabic(LEV)–  GulfArabic(GLF)–  NorthAfricanArabic(NOR):Moroccan,Algerian,Tunisian–  Iraqi,Yemenite,Sudanese,Maltese?

•  Socialdialects–  City,Rural,Bedouin–  Gender,Religiousvariants

Page 5: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

5

IntroducBon•  ArabicDiglossia

– Diglossiaiswheretwoformsofthelanguageexistsidebyside

– MSAistheformalpubliclanguage• Perceivedas“languageofthemind”

– DialectalArabicistheinformalprivatelanguage• Perceivedas“languageoftheheart”

•  GeneralArabpercepBon:dialectsareadeterioratedformofClassicalArabic

•  ConBnuumofdialects

Page 6: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

6

CodeSwitching

الأنامابعتقدألنهعمليةالليعمبيعارضوااليومتمديدللرئيسلحودهمالليطالبوابالتمديدللرئيسالهراويوبالتاليموضوعمنهموضوعمبدئيعلىاألرضأنابحترمأنهيكونفينظرةديمقراطيةلألموروأنهيكونفياحترامللعبةالديمقراطيةوأنيكونفيممارسةديمقراطيةوبعتقدإنهالكلفي

علىموضوعإنجازاتبسبدييرجعلحظةأكثريةساحقةفيلبنانتريدهذااملوضوع،لبنانأوفيلبنانمنالنظامرئاسينظامفيلبنانالنظامعنإنجازاتالعهدلكنهليعنينعمنحكيالعهد

عمليابيدالحكومةمجتمعةوالرئيسلحودأثبتهيرئاسيوبالتاليالسلطةنظامبعدالطائفليسشخصمسؤولفيمنصبمعنيوأناعشتهذااملوضوعبأنهملابيكونفياألخيرةممارستهخالل

صالحةضمنخطابومبادئخطابملابياخدمواقفشخصيابممارستيفيموضوعاالتصاالتالسلطةالتنفيذيةألنهمنهرئيسجمهوريةهويكونرئيسمشمطلوبمنإنماهوإلىجانبهالقسم

عليهالتوجيهعليهإبداءاملالحظاتعليهبقىفيلبنانمابعدإتفاقالطائفرئيسالسلطةالتنفيذيةالوطنيةالشاملةكييظلفيمصالحةوطنيةكييظلالقولماهوخطأوماهوصحعليهتثميرجهود

باتجاهيروحتوافقمابنياملسلمواملسيحيفيلبنانيحتضنأبناءهذاالبلدمايتركاملسارفيوآمنوافيهاالليمشيوامعهالخطأنعمإنماخطابالقسمكانموضوعمبادئطرحتهوملتزمفيها

التزموافيهاأناأثبتخاللاألربعسنواتباملمارسةالحكوميةأنيالتزمتفيهاوملاالتزمنابهذاأنابتفهمتمامااملوضوعكانالرئيسلحودإلىجنبنافيهذااملوضوع،أمااملوضوعالديمقراطي

فتحإعادةانتخابهذاهالوجهةالنظربسماممكننقولإنهالدستورأوتعديلههوأوإمكانيةمسحهيئةفيجمهوريةبواليةثانيةهوديمقراطيضمناملجلسوالتصويتإلىماهنالكلرئيس

قناعتيفيهذااملوضوع.يعنيجوهرالديمقراطيةهذاباألقل

MSAandDialectmixinginspeech• phonology,morphologyandsyntax

AljazeeraTranscripthQp://www.aljazeera.net/programs/op_direcBon/arBcles/2004/7/7-23-1.htm

MSA

LEV

Page 7: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

WhyisArabicprocessinghard?

Arabic EnglishOrthographicambiguity More LessOrthographicinconsistency More LessMorphologicalinflecBons More LessMorpho-syntacBccomplexity More LessWordorderfreedom More LessDialectalvariaBon More Less

Page 8: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

ComputaBonalProcessingofStandardArabic

•  TherehasbeenalargeandgrowingamountofworkonStandardArabicprocessing:–  MulBplemorphologicalanalyzersandtaggers

•  BAMA/SAMA,Elixir,AlKhalil,ALMOR,MADAMIRA,etc.

–  MulBpletreebanksandparsers•  PennATB,PragueDTB,CATiB,QuranCorpus

–  LargecollecBonsofmonolingualtext•  Gigaword,newscollecBons,QALB,andothers

–  LargecollecBonsofbilingual/mulBlingualtext•  UNcorpus,newscollecBons,etc.

–  SenBmentResources•  ArSenL,SLSA,SAMAR,etc.

–  NottomenBonthetradiBonalresourcesonlexicography,morphologyandsyntax!

•  MuchmoretodotosBll!•  Resourcesandworkondialectsareverylimitedincomparison.

8

Page 9: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

9

WhyWorkonArabicDialects?•  DialectsaretheprimaryformofArabicusedinallunscriptedspokengenres:conversaBonal,talkshows,interviews,etc.–  SpeechrecogniBonanddialoguesystemsmustmodeldialects

•  DialectsareincreasinglyinuseinnewwriQenmedia(newsgroups,weblogs,forumsetc.)–  TextanalyBcsofArabicmustincludedialectalmodeling

•  SubstanBalDialect-MSAdifferencesimpededirectapplicaBonofMSANLPtools

Page 10: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

ComputaBonalChallenges

•  Enormousvariety– Manydialectsandsub-dialects,codeswitching

•  Orthographicambiguity– Under-specificaBonandinconsistency

• Morphologicalcomplexity– morecliBcsandlessmorphofeaturesthanMSA

•  Overallannotatedresourcepoverty–  Thereisalotofmonolingualrawdata–  Limitedlexicons–  Limitedtreebanks,propbanks,etc.

10

Page 11: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

ComputaBonalSoluBons•  TreatArabicdialectsasdifferentlanguages

–  Buildresourcesandtoolsfromscratch•  Morphologicalanalyzers,annotatedtreebanks,paralleldata…

–  Pro:modeldifferentgenres–  Con:expensive,effortduplicaBon

•  ExploitsimilaritybetweendialectsandMSAandamongdialects–  Convert(orrelate)dialectalresourcestoMSAorviceversatoadapt–  Pro:lessduplicaBon,exploitsrelaBonships–  Con:thereisalimittohowwellthiswillwork

•  Hybridapproach•  Communitystandards

–  Orthography,morphologicalanalysis,POStagsets,treebanks,etc.

11

Page 12: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

12

Roadmap

• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons

Page 13: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

13

DialectalPhonologicalVariaBons•  Major variants

•  Some of many limited variants

•  /l/ à/n/ MSA: /burtuqāl/ à LEV: /burtʔān/ ‘orange’

•  /ʕ/ à /ħ/ MSA: /kaʕk/ à EGY: /kaħk/ ‘cookie’

•  Emphasis add/delete: MSA: /fustān/ à LEV: /fustān/ ‘dress’

MSA Dialects ق /q/ /q/,/k/,/ʔ/,/g/,/ʤ/ث /θ/ /θ/,/t/,/s/ذ /δ/ /δ/,/d/,/z/ج /ʤ/ /ʤ/,/g/,/j/

Page 14: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

ArabicScriptOrthographicVariants

IRQ LEV EGY TUN MOR/ʤ/ ج ج چ ج ج/g/ گ چ ج ڨ ڭ/tʃ/ چ تش تش تش تش/p/ پ پ پ پ پ/v/ ڤ ڤ ڤ ڥ ڥ

Page 15: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

15

LaBnScriptforArabic?•  SeveralproposalstotheArabic

LanguageAcademyinthe1940s•  SaidAklExperiment(1961)•  WebArabic(Arabizi,Arabish,Franco-arabe)

–  Nostandard,butcommonconvenBons

عربي IPA La(n عربي IPA La(nأإآءؤئ /ʔ/ ‘ 2 Ø ث /θ/ th

ة /a/,/t/ a t ط /tʕ/ t T 6

ح ħ H h 7 ع /ʕ/ ‘ 3 Ø

خ /x/ kh 7’ x 8 غ /ʁ/ g gh 3’

ذ /δ/ th ق /q/ q

ش /ʃ/ sh ch ي /y//ay//ī//ē/

y,i,e, ai,ei,…

Akl1961

Page 16: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

16

LackofOrthographicStandards

•  Orthographicinconsistency

•  EgypBan/mabinʔulhalakʃ/

– mAbinquwlhAlak$ مابنقولهالكش– mAbin&ulhalak$ مابنؤلهالكش – mAbin}ulhAlak$ مابنئلهالكش– mAbinqulhAlak$ مابنقلهالكش– …

Page 17: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

SpellingInconsistency

•  SocialmediaspellingvariaBons– +ak– +aaaaak– +k

Page 18: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

18

ArabicLexicalVariaBon

•  ArabicDialectsvarywidelylexically

•  ArabicorthographyallowsconsolidaBngsomevariaBons

English Table Cat Of I_want There_is There_isn’tMSA Tāwila

طاولةqiTTaقطة

idafaØ

‘uriduاريد

yūjaduيوجد

lāyujaduاليوجد

Moroccan midaميدة

qeTTaقطة

dyālديال

bγītبغيت

kāynكاين

mākāynšماكاينش

Egyp(an Tarabēzaطربيزة

‘oTTaقطة

bitāςبتاع

ςāwezعاوز

gفي

magšمفيش

Syrian Tāwleطاولة

bisseبسة

tabaςتبع

biddiبدي

gفي

māfiمافي

Iraqi mēzميز

bazzūnaبزونة

mālمال

‘arīdاريد

akuاكو

mākuما

Page 19: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

CODA:AConvenBonalOrthographyforDialectalArabic

•  Developed by CADIM for computational processing •  Objectives

– CODA covers all DAs, minimizing differences in choices

– CODA is easy to learn and produce consistently – CODA is intuitive to readers unfamiliar with it – CODA uses Arabic script

•  Inspired by previous efforts from the LDC and linguistic studies

19

Page 20: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

CODAExamples

CODA االمتحانات قبل اللي الفترة صحابي ماشفتش

gloss the exams before which the period my friends I did not see

Spelling variants

متحاناتإلا بلأ ـىاللـ هالفتر ـىصحابـ شفتشماـمتحاناتلـا بلا لليإ ةرطـالفـ حابيوصـ شفتشمـ

ناتـحـاالمتـ abl ـىللـإ هرطـالفـ ـىحابـوصـ فتشوماشـناتـحـمتـإلا qbl ـيلـا il�ra Su7abi فتشوشـماناتـحــمتـلـا qabl لىا sohaby فتشوشـمـ

ilimB7anat ـيإلـ masho�ish

limBhanaat إلىilli

Page 21: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

CODAExamples

21

Phenomenon Original CODASpellingErrorsTyposSpeecheffectsMergesSplits

االجابهشبب

كبييييييييراليومبريستيج

روف املع

اإلجابةسببكبير

اليوم بريستيجاملعروف

MSARootCognate آلب، كلب قلبDialectalCli(cGuidelines

عهلبيتمشفناش

عهالبيتماشافناش

UniqueDialectWords بردو، برضو برضه

Page 22: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

CODAfica(onRawOrthographytoCODAConversion

•  What:-ConvertsfromrawDAorthographytoCODA-Correctstyposandvariousspeecheffects

•  Approach• Eskanderetal.(2012)(CODAFY)

• Modelspecificphenomena:hamza,PluralwAsuffix,etc.• Supervisedlearning• ClassificaBonproblem

• Farraetal.(2014)• Generalizedcharacterreplacementmodel.

• Bestresults–integratedinmorphologicalanalysis(MADA-ARZ)

CODAfica(on Accuracy(tokens)

A/YNorm.Accuracy(tokens)

Baseline(doingnothing) 76.8% 90.5%

CODAFYv0.4 91.5% 95.2%

MADA-ARZ 92.9% 95.5%

Input مشفتش صحابى الفتره الى فاتتm$s$SHAbYAlsrhAlYfAt

Output ما شفتش صحابي الفترة اللي فاتتmA$s$SHAbyAlsrpAllyfAt

•  Example:

•  EvaluaBon:•  EgypBanArabic

Page 23: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

3ArribArabizi-to-ArabicConversion

•  AsystemforautomaBcmappingofArabizitoArabicscriptinCODA

•  EvaluaBon–  transliteraBoncorrect83.6%ofArabicwordsandnames.

anamsh3arefa2raellyentakatboAnAm$EArfAqrAAllyAntkAtbh

انامشعارفاقراالليانتكاتبهwfelaa5ertele3fshenkwmab2raasharabicwflAxrTlEf$nkwmab2raashArAbyk

ارابيكmab2raashو+فال+اخرطلعفشنكو

(Al-Badrashinyetal.,CONLL2014;Eskanderetal.,EMNLPCodeSwitchWorkshop2014)

Page 24: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

3ArribhQp://nlp.ldeo.columbia.edu/arrib/

•  x

24

Page 25: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

25

Roadmap

• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons

Page 26: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

26

DialectalArabicMorphologicalVariaBon

•  Nouns–  Nocasemarking

• WordorderimplicaBons–  ParadigmreducBon

•  ConsolidaBngmasculine&feminineplural

•  Verbs–  ParadigmreducBon

•  Lossofdualforms•  ConsolidaBngmasculine&feminineplural(2nd,3rdperson)•  Lossofmorphologicalmoods

–  SubjuncBve/jussiveformdominatesinsomedialects–  IndicaBveformdominatesinothers

•  Otheraspectsincreaseincomplexity

Page 27: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

27

DAMorphologicalVariaBonVerbMorphology

conjverbobject subj tense

IOBJ negneg

MSAولمتكتبوهاله

/walamtaktubūhālahu//wa+lamtaktubū+hāla+hu/and+not_pastwrite_you+itfor+him

EGYوماكتبتوهالوش

/wimakatabtuhalūʃ//wi+ma+katab+tu+ha+lū+ʃ/

and+not+wrote+you+it+for_him+not

Andyoudidn’twriteitforhim

Page 28: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

28

Perfect Imperfect

Past SubjuncBve Presenthabitual

Presentprogressive

Future

MSAكتب

/kataba/يكتب

/jaktuba/يكتب

/jaktubu/يكتبسـ

/sajaktubu/

LEVكتب

/katab/يكتب/jiktob/

يكتببـ/bjoktob/

يكتببـعم/ʕam bjoktob/

يكتبحـ/ħajiktob/

EGYكتب

/katab/يكتب/jikBb/

يكتببـ/bjikBb/

يكتبهـ/hajikBb/

IRQكتب/kitab/

يكتب/jikBb/

يكتبد/dajikBb/

يكتبرح/raħjikBb/

MORكتب/kteb/

يكتب/jekteb/

يكتبكـ/kjekteb/

يكتبغـ/ʁajekteb/

DAMorphologicalVariaBon

Page 29: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

29

DAMorphologicalVariaBonVerbconjugaBon

Perfect Imperfect

1S 2S♂ 2S♀ 1S 1P 2S♀

MSA ت كتبـ /katabtu/

تكتبـ /katabta/

تكتبـ

/katabti/

كتب ا

/aktubu/

كتب نـ

/naktubu/

ين كتبـتـ/taktubīna/

ـيكتبـتـ

/taktubī/

LEV ت �كتبـ/katabt/

تي كتبـ

/katabti/

كتب ا/aktob/

كتبنـ /noktob/

ـيكتبـتـ

/toktobi/

IRQ ت �كتبـ/kitabt/

تيكتبـ

/kitabti/

كتب ا/aktib/

كتب نـ/niktib/

ينكتبـتـ

/tikitbīn/

MOR ت كتبـ/ktebt/

�تي كتبـ/ktebti/

كتب�نـ/nekteb/

وا�كتبـنـ/nektebu/

ـيكتبـتـ

/tektebi/

Page 30: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

MorphologicalAmbiguity

•  Morphological richness – Token Arabic/English = 80% – Type Arabic/English = 200%

•  Morphological ambiguity – Each word: 12.3 analyses and 2.7 lemmas

•  Derivational ambiguity العني –  the eye, the water spring, Al-Ain city, the notable

Page 31: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Analysisvs.DisambiguaBon

Will will Ben Affleck be a good Batman?

PV+PVSUFF_SUBJ:3MS bay~an+a Hedemonstrated

PV+PVSUFF_SUBJ:3FP bay~an+~a Theydemonstrated(f.p)

NOUN_PROP biyn Ben

ADJ bay~in Clear

PREP bayn Between,among

Morphological Analysis is out-of-context Morphological Disambiguation is in-context

أفليكفيدورباتمان؟بنيهلسينجح

Page 32: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Analysisvs.Disambigua(on

Will Ben Affleck be a good Batman?

PV+PVSUFF_SUBJ:3MS bay~an+a Hedemonstrated

PV+PVSUFF_SUBJ:3FP bay~an+~a Theydemonstrated(f.p)

NOUN_PROP biyn Ben

ADJ bay~in Clear

PREP bayn Between,among

Morphological Analysis is out-of-context Morphological Disambiguation is in-context

*

أفليكفيدورباتمان؟بنيهلسينجح

Page 33: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

W-3 W-2 W-1 W0 W1 W2 W3 W4 W-4

MORPHOLOGICAL ANALYZER

MORPHOLOGICAL CLASSIFIERS

•  Rule-based

•  Human-created

•  Multiple independent classifiers •  Corpus-trained

2nd

3rd

5th 4th

1st

RANKER

•  Heuristic or corpus-trained

MADA (Habash&Rambow 2005;Roth et al. 2008) MADAMIRA (Pasha et al., 2014)

Page 34: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

MADAMIRA•  NewesttoolfromtheCADIMgroup(Pashaetal.,

2014)•  CombinesMADA(Habash&Rambow,2005)and

AMIRA(Diabetal.,2004)–  MorphologicaldisambiguaBon–  TokenizaBon–  Basephrasechunking–  NamedenBtyrecogniBon

•  MSAandEgypBanArabicmodes•  Server-modewithXMLinterface•  Onlinedemo

–  hQp://nlp.ldeo.columbia.edu/madamira/–  hQp://camel.abudhabi.nyu.edu/madamira/

InputArabicText

MorphologicalDisambigua(on

Tokeniza(on

BasePhraseChunking

NamedEn(tyRecogni(on

UserNLPApplica(ons

Page 35: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

MorphologicalDisambiguaBon

System MDMRA-MSA MADA-ARZ

TrainingData MSA MSA ARZ MSA+ARZ

TestSet MSA EGY

All 84.3% 27.0% 75.4% 64.7%

POS+Features 85.4% 35.7% 84.5% 75.5%

FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%

Lemma(za(on 96.1% 67.1% 86.3% 82.8%

BasePOS-tagging 96.1% 82.1% 91.1% 91.4%

ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%

wakAtibu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0

w+ kAtb

wkAtbوكاتب and (the) writer of

Page 36: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

CALIMA-EgypBanv0.5•  CALIMAistheColumbiaArabicLanguageMorphological

Analyzer•  CALIMA-EGY

•  Extends the EgypBan Colloquial Arabic Lexicon (ECAL) (Kilany et al.,2002) and Standard ArabicMorphological Analyzer (SAMA) (Graff etal.,2009).

•  Follows the part-of-speech (POS) guidelines used by the LDC forEgypBanArabic(Maamourietal.,2012b).

•  AcceptsmulBpleorthographicvariantsandnormalizesthemtoCODA(Habashetal.,2012).

•  Incorporates annotaBons by the LDC for EgypBan Arabic. (~ 250Kwords)

Page 37: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

CALIMA-ARZExample

katab_1LemmamA_katabt_lahA$CODAmA/NEG_PART+katab/PV+t/PVSUFF_SUBJ:2MS++li/PREP+hA/PRON_3FS+$/NEG_PART

POS

not+write+you+to/for+it/them/her+notGloss

katab_1LemmamA_katabit_lahA$CODAmA/NEG_PART+katab/PV+it/PVSUFF_SUBJ:3FS+li/PREP+hA/PRON_3FS+$/NEG_PART

POS

not+write+she/it/they+to/for+it/them/her+notGloss

mktbtlhA$ مكتبتلهاش

Page 38: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

CALIMA-EgypBanv0.5

•  IncorporatesLDCARZannotaBons(p1-p6)– 251Ktokens,52Ktypes– AnnotaBoncleanupneeded– ExtendsSAMA(StandardArabicMorphAnalyser)

System TokenRecall

TypeRecall

SAMAv3.1(StandardArabic) 67.7% 59.7%CALIMA-EGYv0.5(EgypBancore) 88.7% 75.8%CALIMA-EGYv0.5(++SAMAdialectextensions) 92.6% 81.5%

Page 39: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

MorphologicalDisambiguaBon

System MDMRA-MSA MADA-ARZ

TrainingData MSA MSA ARZ MSA+ARZ

TestSet MSA EGY

All 84.3% 27.0% 75.4% 64.7%

POS+Features 85.4% 35.7% 84.5% 75.5%

FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%

Lemma(za(on 96.1% 67.1% 86.3% 82.8%

BasePOS-tagging 96.1% 82.1% 91.1% 91.4%

ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%

wakAtibu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0

w+ kAtb

wkAtbوكاتب and (the) writer of

Page 40: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

MorphologicalDisambiguaBon

System MDMRA-MSA MDMRA-EGY

TrainingData MSA MSA EGY MSA+EGY

TestSet MSA Egyp(anArabic(EGY)

All 84.3% 27.0% 75.4% 64.7%

POS+Features 85.4% 35.7% 84.5% 75.5%

FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%

Lemma(za(on 96.1% 67.1% 86.3% 82.8%

BasePOS-tagging 96.1% 82.1% 91.1% 91.4%

ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%

Page 41: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/

 ي •

Page 42: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/

Page 43: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/

Page 44: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

44

Roadmap

•  IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons

Page 45: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Towards Morphological Tagging of a New Dialect?

•  Review the literature –  Hidden gems from previous efforts

•  Data Collection •  Data Annotation

–  Guidelines: CODA, POS tags, etc. –  Noisy automatic processing: Egyptian MADAMIRA? –  Training annotators, quality control –  This is necessary to benchmark at least

•  Building the Morphological Analyzer –  Eskandar et al. (2013)’s technique for paradigm completion –  Salloum and Habash’s (2011) ADAM method for extending MSA

•  Building the Morphological Tagger –  MADAMIRA framework, e.g. Egyptian Arabic (Habash et al. 2012) –  Other tagging techniques

45

Page 46: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Towards Morphological Tagging of a New Dialect?

•  Review the literature –  Hidden gems from previous efforts

•  Data Collection •  Data Annotation

–  Guidelines: CODA, POS tags, etc. –  Noisy automatic processing: Egyptian MADAMIRA? –  Training annotators, quality control –  This is necessary to benchmark at least

•  Building the Morphological Analyzer –  Eskandar et al. (2013)’s technique for paradigm completion –  Salloum and Habash’s (2011) ADAM method for extending MSA

•  Building the Morphological Tagger –  MADAMIRA framework, e.g. Egyptian Arabic (Habash et al. 2012) –  Other tagging techniques

46

•  Curras Corpus (Jarrar et al., 2014)

•  Gumar Corpus (Khalifa et al., 2016)

Page 47: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

The Gumar Corpus: A Morphologically Annotated Corpus of Gulf Arabic

•  ~100 million words •  Mainly long conversational novels published

anonymously online ( النتروايات ‘Internet novels’). •  Writers of the novels remain anonymous under

pen names. Although there is no claim of copyrights, it is conventional to credit the writer when the material is copied/transferred as per the writer request.

Page 48: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

السالم علیكم

القصه هاذي قطریه روعه أتمنى انها تعجبكم طبعا أهي قصه منقوله من منتدى ثاني

وطبعا مصرحه الكاتبه نقل القصه مع ذكر اسمها وهي الكاتبهتحفه فنیه )) القطریه ((

نبدأ .....

الكاتبة تحفة فنیة

الفصل االول :-

وضحة والتوتر بدا یظهر علیها : الجازي ماتدرین عمي متى بیجي ؟ الجازي : واهللا یختس مدري بس ماهو باطي ،اله انت وشعندس الیوم على ابوي

؟ اخبرس ما تحبین مقعاد معاه ؟توترها : سالمتس بس بغیت اسلم علیه قبل ما وضحة وهي تحاول السیطرة على یجي حمد و نروح البیت ، قدلي كم مرة اجي وال القاه عد مهب عدله من زمان

ماوجهته . الجازي وهي تغمز عینها : ماوجهتي ابوي وال تنطرین ناس ؟

خجل على طول صار وجه وضحة احمر مثل الطماطم ، والجازي اعتبرت انه وتمت تضحك على وضحة ما تدري ان سبب احمرار وضحة هو القهر وجرح

الكرامة الى تحس به من بدت تلمح عن راشد و تقول في نفسها ماتدرین یالجازي، وفي هذه اللحظة انزلت علیهم ام راشد مرت عم ان اتمنه العمى وال اشوفه

وضحة جایه من غرفتها وفي ایدها كیسه كبیره ومدته على وضحة وهي تقول :خلها توزعه كلن وضحة یمس هذي صوغتن لكم من عند راشد عطیها امس

تعطیه حقه .

An example of raw text (Qatari) from a novel

Page 49: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Gumar Corpus Statistics

Words 112,410,688 Sentences 9,335,224 Documents 1,236

•  Words are whitespace tokenized and the counts include punctuation.

•  Number of sentences represents the number of lines. •  Each document generally represents a single novel

Page 50: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Gumar Corpus Dialect Distribution

(Document level)

Dialect Percentage SA 60.52 AE 13.35 KW 5.91 OM 1.13 QA 0.65 BH 0.94 GA (other) 10.03 Arabic (other) 7.93

•  92% of the corpus is written in GA with SA being the most dominant.

•  GA (other) are the cases of a novels containing a combination of several GA dialects. Or the case of dialect ambiguity (esp. between OM, QA and AE)

•  The rest of the corpus (7.93%) is mostly MSA (original text or translation attempts of existing non Arabic text) and other DA such as Egyptian, Iraqi, Levantine, ... etc.

Page 51: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Morphological Analysis Evaluation

•  Preliminary investigation into GA annotation are performed.

•  4000 words from text are annotated manually for: –  Orthography (CODA) –  Morphology (tokenization) –  Part-of-speech –  Lemma

•  Same text was given to MADAMIRA (MSA & EGY) –  Outputs are then evaluated against the gold standard.

Page 52: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Gulf CODA

•  CODA: Conventional Orthography for Dialectal Arabic (Habash et al. 2012).

•  There exist CODA guidelines for both EGY and PAL (Palestinian Arabic).

•  CODA guidelines for different dialects share general rules that applies to all.

•  Exceptional cases differs from one dialect to another.

Page 53: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Gulf CODA •  One main feature that is different among dialects is the

root consonant mapping rules.

•  General rules: spelling Al, Ta Marbuta, clitic attachment •  Other examples of specific spelling…

سيدا، مب، مانيب، +ج\+ك

MSA/CODA Variants CODA Compliant CODA non-compliant

قدام /q/ or /ɡ/ or/ʤ/ ق جدام

�كبد /k/ or /ʧ/ or /ts/ ككذب

�جبدتسذب

جلس /ʤ/ or /j/ ج يلسشاي /ʃ/ or /ʧ/ ش چاي

Page 54: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

CODAfied text examples

Example 1 Raw ياويلتس منتس هالحتسي اسمع

CODA ياويلج منج هالحكي اسمعEnglish

Example 2 Raw جاهز؟ الغدى عسى

CODA جاهز؟ الغدا عسىEnglish

Example 3 Raw الجامعهفياللحنياناصغيررونهمنيبساره

CODA الجامعةفيالحنياناصغيرونةمانيبسارةEnglish

Page 55: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

An Annotation Example

Page 56: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Morphological Analysis Evaluation

•  Preliminary investigation into GA annotation are performed.

•  4000 words from text are annotated manually for: –  Orthography (CODA) –  Morphology (tokenization) –  Part-of-speech –  Lemma

•  Same text was given to MADAMIRA (MSA & EGY) –  Outputs are then evaluated against the gold standard.

Page 57: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Morphological Analysis Evaluation

•  Accuracy measure for the annotated features again the automatic output of MADAMIRA in two modes (MSA and EGY)

•  MADAMIRA-EGY outperforms MADAMIRA-MSA on different metrics, confirming that it is better to use it as a baseline for manual annotation.

•  Similar conclusions were reported by Jarrar et al. (2014)

Feature MADAMIRA-MSA MADAMIRA-EGY

Ortho 83.81 88.34

Morph 76.16 83.62 POS 72.37 80.39 Lemma 64.03 81.51

Page 58: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Summary & Future Directions •  Arabic dialects pose many challenges to NLP

–  No orthographic standards –  Limited resources –  Large number of differences from MSA

•  A combination of solutions works best –  Exploit similarities between dialects and MSA –  Exploit similarities among dialects –  Address differences through resource building

•  Our goal is to make basic support for MSA and Dialects at the level of English –  So, we can focus more on higher level applications!

Page 59: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Summary & Future Directions Although dialect processing may seem daunting, just remember •  Breathe! There are rules in the dialects. Just not the

same rules as the ones in MSA.

•  All these challenges are amazing opportunities to advance NLP –  Not just for Arabic but for all languages.

•  For Arabic native speakers, working with dialects is an eye opener (and can be a lot of fun!)

Page 60: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Announcements •  Project MADAR

–  Multi-Arabic Dialect Applications and Resources –  QNRF funded project –  Collaboration among CMUQ, NYUAD and Columbia –  Modeling 25 Arabic city dialects

•  Lexical resources, parallel data, dialect id, dialect MT –  Looking for linguists and postdocs!

•  WARDAT 2016 –  First Workshop on Arabic Dialect Technologies –  Discuss future of collaborations on Arabic Dialect Technologies –  Funded by the NYUAD Institute; to be held in NYU Abu Dhabi –  By invitation. Limited slots. Contact me if interested.

•  CAMeL Lab –  Hiring postdocs! –  Funded NYU PhD in Computer Science. –  Contact me if interested.

Page 61: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Announcements •  Project MADAR

–  Multi-Arabic Dialect Applications and Resources –  QNRF funded project –  Collaboration among CMUQ, NYUAD and Columbia –  Modeling 25 Arabic city dialects

•  Lexical resources, parallel data, dialect id, dialect MT –  Looking for linguists and postdocs!

•  WARDAT 2016 –  First Workshop on Arabic Dialect Technologies –  Discuss future of collaborations on Arabic Dialect Technologies –  Funded by the NYUAD Institute; to be held in NYU Abu Dhabi –  By invitation. Limited slots. Contact me if interested.

•  CAMeL Lab –  Hiring postdocs! –  Funded NYU PhD in Computer Science. –  Contact me if interested.

Page 62: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

Announcements •  Project MADAR

–  Multi-Arabic Dialect Applications and Resources –  QNRF funded project –  Collaboration among CMUQ, NYUAD and Columbia –  Modeling 25 Arabic city dialects

•  Lexical resources, parallel data, dialect id, dialect MT –  Looking for linguists and postdocs!

•  WARDAT 2016 –  First Workshop on Arabic Dialect Technologies –  Discuss future of collaborations on Arabic Dialect Technologies –  Funded by the NYUAD Institute; to be held in NYU Abu Dhabi –  By invitation. Limited slots. Contact me if interested.

•  CAMeL Lab –  Hiring postdocs! –  Funded NYU PhD Program in Computer Science. –  Contact me if interested.

Page 63: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

•  http://nyuad.nyu.edu/en/

63

Thank You! Questions?