p02- towards a new arabic corpus of dyslexic texts
TRANSCRIPT
Towards A New Arabic Corpus of Dyslexic Texts
Maha A lamr i E [email protected] .ukWi l l iam John TeahanW. J [email protected] .uk Schoo l o f Computer Sc ience .
Bangor Un ivers i ty .
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
2
Outline Introduction.
Arabic Corpus of Dyslexic Texts.
Towards Automatic Correction of Dyslexic Errors.
Conclusion.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
3
Introduction The focus of this presentation is the creation of a new Arabic corpus of texts written by dyslexics and software for automatic spelling correction for Arabic texts written by dyslexics.
Dyslexia:
Its roots in the Greek word ‘dys-’, meaning difficulty with, and the word ‘-lexia’,
which means language or word.
Inability to master the utilization of written language, including issues with
comprehension.
1 IN 10 people have a dyslexia.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
4
Introduction The main area of interest lies in the zone of convergence represented by the overlap area as illustrated:
Dyslexia Arabic Corpus
Automatic spelling correction
The term denotes the way in which a misspelled word is identified by a program and is then altered to its correct form.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
5
Spelling Errors Common Spelling Errors (Damerau, 1964):
Additional letters e.g. unniverse.
Omitted letters e.g. univ rse.
Substituted letters e.g umiverse.
Swapped letters e.g. uinverse.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
6
Dyslexia Spelling Errors Words contain certain silent letters (knife).
Morphemes in the case of when affixes are added:
explain – explanation.
The struggle of dyslexic writers with the relationship between the
sound of a word and how it is spelt.
The inability to preserve in memory orthographic symbols makes it
difficult for dyslexics to remember the right order of letters in a word.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
7
Spelling errors by Arabic writers with dyslexia
Phonetic errors.
Irregular spelling rules.
Word omission.
Hamza.
Long vowel.
Exchanging consonants.
Difficulty in writing the letters in the correct shape.
The Arabic word is spelt according to how they hear it in the local spoken
dialect.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
8
Arabic Corpus of Dyslexic Texts
The rate of misspellings in the text is noticeably higher in the case of children. Therefore, the texts were collected from female primary school students with dyslexia who have been taught in resource rooms, been professionally diagnosed with dyslexia.
BDAC information:
Text: Writing exercises (Homework).
Size: 1067 words containing 694 errors.
Year: 2013.
Language: Arabic.
Country of production: Saudi Arabia (Riyadh).
The Bangor Dyslexic Arabic Corpus (BDAC) has the
character of a preliminary version, which aims to
investigate the possibility of a corpus being used as an aid for Arabic dyslexic
writers.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
9
Example Dyslexic Text Screenshot of a scanned image of one of the texts written by a dyslexic female child (nine years old).
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
10
Example Dyslexic Text
This example includes basic errors as below:
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
11
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
12
Analysis of the BDAC errors
3. Substitution (47 times), commonly found in: replacement of (Heh - ه) to (Teh Marbuta - ة) or vice verse, changing (Heh -ه or Teh Marbuta - or vice verse and (ت - Teh) with the letter (ةexchanging the letter (Dad - ض) with (Zah - ظ) or vice versa.
4. Transposition (19 times).
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
13
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
14
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
15
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
16
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
17
Towards Automatic Correction of Dyslexic Errors The main tool employed was the Text Mining Toolkit (TMT).TMT is a software package designed specifically to conduct
tasks revolving around compression-based language modelling, text categorisation and correction, and segmentation of the text.
The toolkit was used to correct a small number of the dyslexic errors using a method that was similar to the method described by Alhawiti (2014) found effective for the correction of errors in Arabic OCR text.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
18
Towards Automatic Correction of Dyslexic Errors
First, it was crucial to choose a large training corpus of Arabic text to train the compression-based language model created by the toolkit. After researching suitable corpora, the Bangor Arabic Compression Corpus (BACC) created by Dr.Khaled Alhawiti was chosen.
Due to the current limitations of the TMT software, the correction of the dyslexic texts was applied just for one-to-one character errors using the toolkit’s markup correction capabilities that was able to find the most probable corrected sequence given the compression- based language model.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
19
Experimental Results All errors containing more than one character were removed.
1067694
280
BDAC Corpus
TextErrorsone-to-one character errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
20
Experimental Results
153
99
Word
ErrorCorrect
80
49
Sentences
ErrorCorrect
4739
Paragraphs
ErrorCorrect
280
187
Total
ErrorsCorrect
The TMT software was able to correct more than half of the one-to-one character errors.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
21
Conclusion The corpus used in this study offers a useful platform for analysing dyslexic text.
It provides a better understanding of the occurrence of these errors and the factors determining such occurrences and therefore it is suitable for assisting dyslexic writers.
This corpus can serve as a platform for other researchers to build upon.
A preliminary investigation was undertaken into using automatic processing techniques as a form of assistance for Arabic dyslexic writers and some initial success was achieved in the automatic correction of dyslexic errors in Arabic text.
In future work, it requires considerably more resources and effort to extend the corpus to include more text for analysis.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016
22
Thank you.Any questions?