readability assessment and text simplification for basque
TRANSCRIPT
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Readability Assessment and TextSimplification for Basque in the Ixa Group
Itziar Gonzalez-DiosSupervisors: Marıa Jesus Aranzabe and Arantza Dıaz de Ilarraza
IXA NLP Group, University of the Basque Country (UPV/EHU)ixa.eus/Ixa
@IxaGroup
Pisa, 2015
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 1/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 3/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Ixa Group
Research group at the University of the Basque Country(UPV/EHU)
Since 1988
64 members
10 subgroups
Computer Science Faculty of Donostia-San Sebastian
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 4/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Our philosophy
Bottom up conception (progressive development)
Reuse of resources and tools
Open source: Ixa pipes http://ixa2.si.ehu.es/ixa-pipes/
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 5/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Research lines
Creation of basic resources (linguistic resources andprocessors):
Corpora, dictionaries, ontologiesComputational lexicography, morphology, syntax, semantics,pragmatics and discourse
Operational aspects (integration of language tools):
Corpus processingParallel processingCorpus annotation
Language technology applications:
Information extraction and question answeringMachine translationLanguage teaching/learning
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 6/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Research lines
Projects (now):
European: 4National: 4Regional: 3
PhD thesis:
In progress: 19Done: 38
Languages:
Mainly, BasqueEnglish, SpanishQuechua
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 7/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 8/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Basque language
Origin:
Pre Indo European LanguageIsolated
Today, 5 dialects + standard (+ 2 almost lost, + another onedocumented)
Geographical domain:
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 9/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Sociolinguistic info
800.000 native speakers; 1 million understand/speak something
Official: Araba, Bizkaia and Gipuzkoa (the Autonomous Communityof the Basque Country); The north of Navarre
Not official: Lapurdi, Behe-Nafarroa and Zuberoa (Together withBearn, Pyrenees-Atlantiques); Navarre
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 10/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Typology
Agglutinative; Case system: ergative-absolutive; 18 case endings
Head final; Free word order at sentence level
6 vowels, 25 consonants
Example sentences
(1) Mutilakboy-erg
sagarraøboy-abs
janeat-prf
du.aux-3sgerg3sgabs.prs.ind
’The boy has eaten an apple.’
(2) Sagardoaøcider-abs
dastatzekotaste-ven.adn
prestøready-abs
dagoeneanstare-3sgabs.prs.comp.loc
irekitzenopen-ipf
dirabe-3plabs.prs
sagardotegiakø,cider-house-pl.abs,
normaleannormal-loc
urtarrilarenjanuary-gen
20tik20-abl
Aste Santura.eastern-adl
’Cider houses open when the cider is ready to taste, usually from the 20th ofJanuary to Eastern.’
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 11/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 12/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Readability Assessment
IAS
Essay scoring system
ErreXail
Simple vs. complex
Ion Madrazo’s work
B1, B2, C1, C2
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 13/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
IAS
Idazlanen Autoebaloaziorako Sistema (IAS) Auto-evaluation ofessays (Castro-Castro et al., 2008)
Clause number in a sentenceTypes of sentences (questions, negations...)Clause types (temporal, causal...)PoS typesLemma number
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 14/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
ErreXail
Readability assessment system (Gonzalez-Dios et al., 2014)
measures 96 ratio based on linguistic informationuses Machine Learning techniquescollected two corpora of scientific divulgation for adults and children
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 15/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
ErreXail: Linguistic features
Global features: sentence length, word length, sentence number (3ratios)
Lexical features: PoS, lemmas, named entities... (39 ratios)
Morphological features: case markers, verb types, verbmorphology... (24 ratios)
Morphosyntactic features: noun phrases, verb phrases,appositions (5 ratios)
Syntactic features: subordinate clauses (10 ratios)
Pragmatic features: types of connectors and conjunctions (12ratios)
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 16/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
ErreXail: Classification results
Experiments AccuracyAll features 89.50
Lexical features 90.75Lex+Morph+Morph-sint+Sintax 93.50
Table: Classification results with SMO and 10 fold cross-validation
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 17/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
ErreXail: Most predictive features
Features and groups Relevance (InfoGain)Proper nouns / common nouns ratio (Lex.) 0.2744
Appositions / noun phrases ratio (Morpho-synt.) 0.2529Appositions / all phrases ratio (Morpho-synt.) 0.2529Named entities / common nouns ratio (Lex.) 0.2436Unique lemmas / all the lemmas ratio (Lex.) 0.2394
Acronyms / all the words ratio (Lex.) 0.2376Causative verbs / all the verbs ratio (Lex.) 0.2099
Modal-temporal clauses / subordinate clauses ratio (Synt.) 0.2056Destinative case endings / all the case endings ratio (Morph.) 0.1968Connectors of clarification / all the connectors ratio (Prag.) 0.1957
Table: Most predictive features
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 18/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Ion Madrazo’s master thesis (2014)
More linguistic features
DependenciesDepth of the syntactic treeN-gramms at PoS and dependency levelUse of synonymsLatent Semantic Analysis
Other ML techniques
Algorithms to choose the features (Information Gain and CorrelationFeature Selection)Meta algorithms for classification (Ordinal Classification and CostSensitive Learning)
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 19/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Ion Madrazo’s master thesis: Results
Most significant features for each level (B1, B2, C1, C2)
Best results with multinomial Naive Bayes -> % 61.69 accuracy
State-of-the-art results
Similar results with meta algorithms
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 20/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 21/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Lexical simplification
Begun in 2015
Maria Eguimendia’s work for her master thesis
Resources:
A list of lemma frequency from the Corpus Lexikoaren Behatokia(41.773.391 words)Basque WordNetUKB (Word Sense Disambiguation)NAF as input (multilingual)
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 22/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Syntactic Simplification
Begun in 2011
Two main lines:
Linguistic analysis of complex sentences to propose simplificationrulesDeveloping or adapting the tools to perform the automaticsimplification (architecture of the EuTS system)
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 23/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Linguistic analysis: Resources
Corpora:
Reference Corpus for the Processing of BasqueConsumer Corpus (used in Machine Translation)WikipediaElhuyar Corpus (scientific divulgation)
Grammar:
Descriptive Grammar of Basque by Euskaltzaindia (Academy)
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 24/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Linguistic analysis: Tasks
Analysis of each clause type to propose simplification rules
Define a simplification process
Analysis of the frequency and position of each adverbial structurefound in the grammar
Check if the proposed rules are also valid in other domains
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 25/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Linguistic analysis: Simplification process
Spliting: Make as many new sentences as clauses out of the original
Reconstruction: Two operations take place:
Removing no longer needed morphological featuresAdding adverbs or phrases to maintain the meaning
Reordering: Reorder the elements in the new sentences, andordering the sentences in the text
Correction: Correct the possible grammar and spelling mistakes,and fix punctuation and capitalisation
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 26/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Example
Simplification proposal of concessive clauses
(3) a. Hasiera batean aste honetan partidurik ez jokatzea aurreikusita zegoenarren, azken orduan ostiralean partidu bat jokatu nahi izan du Athleticek.(Although it was not foreseen to play a match this week, at the lastmoment Athletic Bilbao has decided to play one on Friday.)
b. i. Hasiera batean aste honetan partidarik ez jokatzea aurreikusita zegoen.(It was not foreseen to play a match this week.)
ii. Hala ere, azken orduan ostiralean partida bat jokatu nahi izan duAthleticek. (However, at the last moment Athletic Bilbao has decidedto play one on Friday.)
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 27/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Linguistic analysis: Simplification levels
1 Syntactic Substitution Simplification (SSS): Frequency basedsimplification of syntactical structures
2 Natural Simplification (NS): Compound and complex sentenceswith finite verbs simplification will follow the simplification processtogether with the SSS
3 Strong or absolute simplification (AS): Everything is simplified(finite and non finite verbs + SSS)
4 Tailored or customised simplification (CS): Only needed orrequired phenomena
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 28/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Syntactic Substitution Simplification (SSS)
-tzearren ‘(in order) to’: % 1.86 (15 instances)
-tzeko ‘(in order) to’: % 88.38 (791 instances)
SSS of non finite purpose clauses
(4) a. Abuztuaren amaieran beste goi bilera bat egitea aztertzen ari diraIsrael eta PAN Palestinako Aginte Nazionala, Ekialde Erdiko bakeprozesua suspertzearren. (Israel and the PNA, Palestinian NationalAuthority, are studying to organise another summit at the end ofAugust to promote the peace process in the Middle East.)
b. i. Abuztuaren amaieran beste goi bilera bat egitea aztertzen aridira Israel eta PAN Palestinako Aginte Nazionala, Ekialde Erdikobake prozesua suspertzeko.
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 29/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Architecture of the EuTS system
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 30/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Architecture of the EuTS system: Developed or adaptedtools
Improved the clause boundary detection grammar
Developed an apposition detector
Developed a readability assessment system ErreXail
Implemented a splitting algorithm (and reconstruction for therelative clauses)
A tool that simplifies biographical data (multilingual) Biografix
SSS
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 31/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Biografix: Example
Living people (original)
(5) Karlos Arginano Urkiola, nazioartean Karlos Arguinano grafiazezagunagoa, (Beasain, Gipuzkoa, 1948ko irailaren 6a) sukaldari, aktoreeta enpresaburu euskalduna da.’Karlos Arginano Urkiola, internationally known with the Karlos Arguinanospelling, (Beasain, Gipuzkoa, 6th September, 1948) is a basque chef, actorand businessman.’
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 32/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Biografix: Example
Living people (simplified)
(6) a. Karlos Arginano Urkiola, nazioartean Karlos Arguinano grafiazezagunagoa, sukaldari, aktore eta enpresaburu euskalduna da.’Karlos Arginano Urkiola, internationally known with the Karlos Arguinanospelling, is a basque chef, actor and businessman.’
b. Karlos Arginano 1948ko irailaren 6an Beasainen jaio zen.’Karlos Arginano was born on the 6th of September, 1948 in Beasain.’
c. Beasain Gipuzkoan dago.’Beasain is in Gipuzkoa.’
Available at http://ixa.si.ehu.es/Ixa/Produktuak/1403535629https://github.com/itziargd/Biografix
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 33/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Evaluation
Extrinsic and manual evaluation of Biografix
Manual evaluation of SSS
Planed evaluations:
Compare our rules to various approaches of simplification (Corpus ofSimplified Text)Extrinsic evaluation through machine translation (which translator?)Comprehension tests (crowdsourcing platforms)
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 34/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Lexical SimplificationSyntactic Simplification
Corpus of Simplified Text
First phase:
3 texts of scientific divulgation (medicine, technology and history)3 annotators (different backgrounds)
A court translator with no idea about simplificationA teacher of Basque as foreign languageA philosoph/writer that writes literature in easy Basque (intuitive)
Which operations do they perform?Do they make common operations?Are those operations similar to ours?
Second phase: other domains
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 35/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Outline
1 Ixa Group
2 Basque language
3 Readability Assessment
4 Text SimplificationLexical SimplificationSyntactic Simplification
5 Current and near future work
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 36/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Current and near future work
Implementation of EuTS:
Adaptation of the analysis output for the morphology generatorFormalisation of the rules written after the linguistic analysis
Waiting for the annotators of the Corpus of Simplified Text ->Analysis of the operations
Exploring the other evaluation possibilities
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 37/38
Ixa GroupBasque language
Readability AssessmentText Simplification
Current and near future work
Readability Assessment and TextSimplification for Basque in the Ixa Group
Itziar Gonzalez-DiosSupervisors: Marıa Jesus Aranzabe and Arantza Dıaz de Ilarraza
IXA NLP Group, University of the Basque Country (UPV/EHU)ixa.eus/Ixa
@IxaGroup
Pisa, 2015
Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 38/38