Download - Arabic NLP: Challenges & Opportunities
Arabic NLP: Challenges & Opportunities
Dr. Samir Tartir
Scientific DayFaculty of InformationPhiladelphia University
May 15th 2013
ثمن
علم
ق
General Information
• History– (Classical) Arabic has remained unchanged, intelligible
and functional for more than fifteen centuries.• Strategically important
– 330 million speakers living in an important region• huge oil reserves, sacred sites.
– 1.4 billion Muslims use in their prayers.• Cultural and literary heritage
– Closely associated with Islam
Distribution
Versions
• Classical• Modern• Dialects
Arabic Language Characteristics
• Highly structured• Highly derivational language
– Morphology• Free word order• Modern Arabic lacks diacritics (short vowels)
Example*
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
Arabic Language Characteristics
• Synonymy and confusion of non-standardized terms– Thermometer: ميزان حرارة، مقياس محرار، محر،
ترمومتر حرارة،• Technical translation
– Hydrometer: السوائل كثافة قياس جهاز• Uncle, parent…
Letters
• One letter, one sound• Letters change shape• Hamza• No capital letters• Can use normalization
Ambiguity• Homographs
– قدم• Internal word structure ambiguity
– بعقوبة• Syntactic ambiguity
– الجديد البنك مدير قابلت• Semantic ambiguity
– ابراهيم من اكثر احمد علي يحب• Anaphoric ambiguity
– انتقده الذي الوزير الصحفي قابل
NLP• Automatic summarization• Machine translation• Named entity recognition
(NER)• Natural language
generation• Natural language
understanding• Optical character
recognition (OCR)
• Question answering• Sentiment analysis• Speech recognition• Word sense disambiguation• Information retrieval (IR)• Speech processing• Text-to-speech• Natural language search• Automated essay scoring• etc
Question Answering**
Hammo et al. QARAB: A Question Answering System to Support the Arabic Language. Workshop on Computational Approaches to Semitic Languages. ACL 2002
Arabic NLP Issues
• Lack of tools• Lack of linguistic references• Lack of training data
Available Tools
• Arabic Treebank• Arabic WordNet
– MySQL database– SUMO Ontology– Java
• Microsoft Arabic Toolkit (ATK)
Summary
• Arabic is difficult to deal with• Progress has been made• More work is done on different parts• Any progress is valuable
– Business– Personal– Governmental
Thank you