intelligent computer-assisted language learning · icall development cycle 1. defining target group...

Post on 25-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Elena Volodina, elena.volodina@svenska.gu.se

Intelligent Computer-AssistedLanguage Learning

Self-presentation: Elena Volodina

● 1998 PhD in Linguistics (Moscow, Russia)● 2008 MA in Language Technologies (Gothenburg, Sweden)

2010 - ...Research Engineer (Språkbanken)→ 2017 - …Researcher (SB-Text)

● Lärka development● ICALL research● Second language resources and algorithms● L2 Swedish infrastructure● L2 profiles● ...

“Teachers never tell you their first names because they don't want you

to Google them”

https://spraakbanken.gu.se/eng/personal/elena

3

Focus on literacy● Dutch study:● → Average reading comprehension ~B1 level

Velleman, E., van der Geest, T.: Online test tool to determine the CEFR reading comprehension level of text. Procedia Computer Science 27 (2014)

4

Literacy: Sweden● PIAAC study focusing on literacy● Sweden among 5 “best” of 23 countries on average● Largest discrepancy between native-born and non-native born citizens● → high unemployment rate● → higher risk for deteriorated health

OECD. 2013. OECD Skills Outlook 2013. First Results from the Survey of Adult Skills.PIAAC. 2013. Survey of Adult Skills (PIAAC).SCB. 2013. Tema utbildning, rapport 2013:2, Den internationella undersökningen av vuxnas färdigheter. Statistiska centralbyrån.

5

Societal need

2015: out of 9,9 mln citizens, 2,2 mln have foreign backgrund, dvs 22,2 %(Statistiska centralbyrån)

6

What can we do?cause versus symptoms

NaturalLanguage

Processing+

technical competence

ComputerAssisted

LanguageLearning

+pedagogical competence

ICALL

NLP + CALL = ICALL

ICALL development cycle

1. Definingtarget group

2. Defininglanguage skill

3. Developing resources

4. Developing tools &

algorithms

5. Developing prototype

6. Evaluatingprototype7. Maintenance

ICALL development cycle

1. Definingtarget group

2. Defininglanguage skill

3. Developing resources

4. Developing tools &

algorithms

5. Developing prototype

6. Evaluatingprototype7. Maintenance

Adults vs kidsHealthy vs special needs

ICALL development cycle

1. Definingtarget group

2. Defininglanguage skill

3. Developing resources

4. Developing tools &

algorithms

5. Developing prototype

6. Evaluatingprototype7. Maintenance

Writing, speaking, reading, listening,

vocabulary, grammar…

ICALL development cycle

1. Definingtarget group

2. Defininglanguage skill

3. Developing resources

4. Developing tools &

algorithms

5. Developing prototype

6. Evaluatingprototype7. Maintenance

Research

Not research?

Not research

Why do we need resources (data)?

L2 exercisesContext-free,

understandable, level-appropriate

Appropriatehttps://spraakbanken.gu.se/larkalabb/infl-mc

Sentence selection needTarget vocabulary and grammar need

• Vocabulary exercise (L2)• Inflection exercise (L2)• Bundled gaps (L2)• Word-based exercises (L2: egg, listening)• …

• Exercises for students of linguistics (L1)

à Corpus of course books

Produced by experts FOR L2 learners

→ reading comprehension texts→ exercises→ recordings of listening excerpt

COCTAILL corpus

Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson 2014. You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. NEALT Proceedings Series 22

COCTAILL

COCTAILL ingredients

How it looks: text topics

https://spraakbanken.gu.se/larkalabb/editor

COCTAILL qualitative explorationstopics across levels

COCTAILL explorations:target skills across levels

From COCTAILL to a graded L2 receptive vocabulary:SVALex

Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia.

http://cental.uclouvain.be/cefrlex/svalex/

From course books to automatic CEFR levelassessment in texts & sentences: HitEx

Ildikó Pilán, Elena Volodina, Lars Borin. (2016). Candidate sentence selection for language learning exercises: from a comprehensive framework to an empirical evaluation. TAL Journal: Special issue NLP for learning and Teaching. Volume 57, Number 3.

HitEx

Machine learning(supervised training)

features trainedclassifier

POStags lexicons

Readabilitystudies

dependencyrelations

80% correct (texts)63% correct (sentences)

Course books (training data)

Languagelearningresources

Features

https://spraakbanken.gu.se/larkalabb/hitex

From SVALex & text classification experiments to text evaluation: TextEval• https://spraakbanken.gu.se/larkalabb/texteval• Text analysis platform• Assessment of learner written language and expert written texts

• CEFR level (machine learning)• Highlighting vocabulary by CEFR level (based on graded word lists)• Out-of-vocabulary items are a challenge

Ildikó Pilán, Elena Volodina and David Alfter. 2016. Coursebook texts as a helping hand for classifying linguistic complexity in language learners' writings. Proceedings of the workshop on Computational Linguistics for Linguistic Complexity (CL4LC), COLING 2016, Osaka, Japan.

https://spraakbanken.gu.se/larkalabb/texteval

Example text

• Den 6 juni utnämndes till Sveriges officiella nationaldag först år 1983, och blev helgdag 2005. Före 1983 var dagen känd som svenska flaggans dag, men har firats som inofficiell nationaldag sedan 1916 – dessförinnan var den känd som Gustafsdagen. Huvudskälet till firandet är just att den då 27-årige Gustav Vasa valdes till kung av Sverige den 6 juni 1523, varpå Kalmarunionen upplöstes och Sverige blev självständigt. Även 1809 och 1974 års regeringsformer, som båda skrevs under den 6 juni, anges som skäl att högtidlighålla dagen. En vanlig stereotyp är att svenskars nationaldagsfirande varken är särskilt omfattande eller patriotiskt – vanligtvis i jämförelse med norrmännens 17 maj-firande. Jonas Engman, sakkunnig i traditionsfrågor vid Nordiska museet, anser att norrmännen snarare är undantaget. - Vi tittar gärna på Norge och frågar oss varför vi inte gör som norrmännen. Norge är dock nog mer ovanligt i sitt firande, sett till övriga Norden. De har varit med om krig, men det har även finländarna och danskarna. Den nationella identiteten spelade nog en stor roll under upplösningen av unionen med Sverige, säger Engman till TT. En kluven dag. Jonas Engman påpekar att den svenska attityden till den 6 juni präglas av en viss kluvenhet. - Bland annat har arbetarrörelsen, som betonade internationell solidaritet över nationell patriotism, varit inflytelserik här. Nationalismen bröt också ut sent hos oss – under sent 1800-tal – medan många av de andra europeiska nationalstaterna tillkom redan efter Napoleonkrigen. Vi är stolta över Sverige på olika sätt, mycket patriotism finns exempelvis i idrotten, säger han. Rättad: I en tidigare version av texten uppgavs fel antal år sedan första officiella firandet.

SweLL pilot

Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia.

SweLL pilot

Topics

L1s

Age

Non-lemmatized items

From SweLL-pilot to productive vocabularySweLLex

Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, Thomas François. 2016. SweLLex: second languagelearners' productive vocabulary. Proceedings of the workshop on NLP4CALL&LA. NEALT Proceedings Series / LiUP

From SweLLex and SVALex to level classificationof new vocabulary: Siwoco

• https://spraakbanken.gu.se/larkalabb/siwoco• Automatic prediction of single word lexical complexity

• SVM, Logistic regression, MLP classifier• Features: Word length, syllables, suffix length, gender, homonymy, polysemy,

compounds, N-grams, topic distribution

• Validation through crowdsourcing

David Alfter, Elena Volodina. 2016. Towards Single Word Lexical Complexity Prediction. Proceedings of the 13th Workshop on Innovative Use of NLP for Building Educational Applications 2018, NAACL.

Second chance: starting NOT from scratch

Grant information

Elena Volodina, Beata Megyesi, Mats Wirén, Lena Granstedt, Julia Prentice, Monica Reichenberg, Gunlög Sundberg. 2016. A Friend in Need? Research agenda for electronic Second Language infrastructure. Proceedings of SLTC 2016, Umeå, Sweden

SweLL promises (main)

à

1. Deliver a well-annotated (gold standard) corpus of L2 essays• 600 essays, approx 100 per CEFR levels A1-C1 + 100 for control L1 learner corpus• Incl manual error annotation & manually checked linguistic annotation• Make available for research (and public?)

SweLL promises (main)

2. Set a platform (and workflow) for • Continuous upload of new essays• Manual error-annotation• Automatic linguistic annotation

à à

SweLL promises (main)

• Set a platform for browsing L2 essays • in concordance fashion (+parallel view)• In full text fashion

SweLL focus (main)

• Adult learners (16+ years)• Healthy learners• Written essays (no speech data)• Where possible – longitudinal data

SweLL promises (side path, rather experimental)

• Design a set of exercises• To elicit (structured) responses that would answer some interesting research questions• To create this way a database that could be used for research

• Develop further Lärka platform for • Deploying the above exercises• Link user answers to their individual ”profiles” (age, gender, L1s, …)

Data

2013 2014 2015 2016 2017

Corpus creation

0

5

10Experiments

Num

ber o

f arti

cles

9

Essay corpus, SweLL-corpus, creation and SweLL-based publications

Curios “time & effort” fact:Data vs experiments

2

Lifetime of corpora vs tools

• Corpora creation costs both in time and money, but:

• Well-documented, representative, reliably annotated and available corporaare used far beyond their initial research purpose

• Penn TreeBank (Marcus et al., 1993; cited 6813 times), is still used for research (e.g. Pawar, A., & Mago, V., 2018)

• ICLE (Granger, 1998; cited 358 times) à modern research (e.g. Möller, 2017)

• Whereas tools trained on corpora get outdated as research makes progress

è

Lifetime of tools

https://spraakbanken.gu.se/larka/archive

2012-2016https://spraakbanken.gu.se/larkalabb/

2016-...

Tools decay, data stay

Available data

Corpus availability(and the legal hassle)

• Necessary step acc to GDPR (EU General Data Protection Regulation)• Names and personality cannot be revealed or traced to the real person• Everyone has the right to know which databases he/she is represented in• Everyone has the right to withdraw from the database

• Hence, we cannot destroy the ”Name ßà ID” mapping keys if wewant to have (longitudinal) data

• Anyone can demand access to the data (acc to Principle of Public Access to Official Records, Swedish law)

• à however no right to use the information!

SweLL agreement form: https://goo.gl/5hKuew

GDPR

• Restrictions on use of personal information to protect ”subjects”, i.e. physical people

• Important consequences for learner corpora (L2) projects –IF you want data to be available for research!

• Metadata precautions• Text de-identification and pseudonymization• Name-ID mapping keys handling

SweLLL2 infrastructure

project

No information on the country of birth

Birthyear: 5-year spans, e.g. 2000-2004

No exact date for entering the L2 country

No information on school or teacher

Pseudonymization of text data: names, cities, ages, professions, etc.

SVALA pseudonym. toolDemo

Example essay (translation into English + mocking errors)

• I live in Guntorp on apartement . I live with my boyfriend . His name is Hans . The apartement mine has a pattio and tree room . Jag enjoy there in Guntorp but a lot of time to goto shop , fortifive minut . I have the bus and the Guntop train . Jag lived in Norway bifore , in Tromsö . It was less than Gunntorp . I enjoy their too becaus I had more friends. I think it is hard to have friends here . But I enjoy better job here . In Tromsö jobbe I only on one website . In Guntorp I work on many website . I am webdevelooper . But Guntorp is closser to Spain than Tromsö . It is important how one lives because I am not in my country . I mess my mother and my father but I live her with my boyfriend .

To-dos (1)• Test NER on original learner texts:

à Can NER speed up the process? Noise? How about essays reviewing books and films, political events etc?

• Automate pseudonymization for English (partly done for Swedish): lists, consequency of geographical namereplacements, etc

à Assess risks of introducing errors that were not in the original text and find ways of avoiding themà Add a possibility of setting the whole text into a ”cultural” context, e.g. Astrid Lindgren’s or Hungarian, etc.

• Test replicating grammatical forms (and errors?) in pseudonymized segments

à e.g. Stadsbibliotekets --> The Volvo’sà Asses the possibility of projecting MSDs from the original text and evaluate their reliability

• Link to Lärka and crowdsource pseudo-tag corrections by essays writers. Learn from ”correction reports”

Reliable and interesting data

Annotation makes data interesting/useful(you get what you annotate)

Annotation…

• …is now the place where linguistics hides in NLP (Fort, 2016)• Parts of speech• Base forms of the words (lemmas)• Syntactic and semantic information• …

Karën Fort. 2016. Collaborative Annotation for Reliable Natural Langugage Processing. Wiley.

Annotation…

• …can ”hide” other disciplines than linguistics• (e.g. so called) Error annotation • Target skills• Receptive vs productive skills• Level of proficiency in a (second/foreign) language• Text genres• …

Implications (for L2 corpora)

• Take other discipline’s perspectives into account, at least• NLP interests• Second Language Acquisition research questions (or a minor share of those)

• It is worth investing time and money into a resource, and work along:• Corpus design (representativity, balance, availability)• Corpus metadata• Corpus annotation & annotation reliability

NLP needs

• NLP often• is ”applied” to other research disciplines and • seeks to assist with other discipline’s research questions

• but there are a range of (traditional) questions• (automatic) error detection• (automatic) error correction• (automatic) essay grading• (automatic) essay classification (e.g. by level of proficiency, genre, topic, grade…)• L1 identification• Linguistic complexity studies (syntax, vocabulary, etc.)• (semi-automatic) anonymization (pseudonymization)• Writing support / feedback • …

SLA needs

• Longitudinal L2 data underlying mental representations and developmental processes (e.g. Myles, 2005)

• Speech data (e.g. Myles, 2005)

• Task-based data (e.g. Alexopoulou et al., 2017)

• Individual cognitive processes (scores from intelligence tests, motivation test, aptitude tests; Granger & Paquot, 2017)

• …

SweLL corpus design principles

• Representativeness• (most popular) immigrant languages• age and gender • levels of proficiency• various tasks ?• L2 vs L1 learners/writers

• Balance

• Annotation• Documentation

Hovy E.H., Lavid J.M. 2010. ”Towards a ”Science” of corpus annotation: a new methodological challenge for corpus linguistics.

Pre-annotation decisions

Post-annotation work

Representative data

Corpus designL1s A1 A2 B1 B2 C1 Control group Total

M F M F M F M F M F M F

Arabic 5 5 5 5 5 5 5 5 5 5 X X 50

Dari/Persian 5 5 5 5 5 5 5 5 5 5 X X 50

English 5 5 5 5 5 5 5 5 5 5 X X 50

Greek 5 5 5 5 5 5 5 5 5 5 X X 50Croatian/BKS 5 5 5 5 5 5 5 5 5 5 X X 50

Sorani 5 5 5 5 5 5 5 5 5 5 X X 50

Kurmanji 5 5 5 5 5 5 5 5 5 5 X X 50

Somali 5 5 5 5 5 5 5 5 5 5 X X 50

Spanish 5 5 5 5 5 5 5 5 5 5 X X 50

Tigrinya 5 5 5 5 5 5 5 5 5 5 X X 50

50 50 50 50 50 50 50 50 50 50 50 50 600

Annotation campaign management

Adriane Boyd

1. Building a corpus (data,

metadata)

2. Tagset,guidelines,

tool

3. Pilot with acorpus sample

4. Qualitativeanalysis

(comparingannotators’ decisions)

5. Quantitativeanalysis (inter-

annotatoragreement)

6. Annotatingcorpus

(biweeklymeeting)

7. Post-campaign:delivery,

maintenance

Representative?Balanced?Accessible?

no

yes

Reliable annotation?Stable annotation? Appropriate tagset?

Guidelines?

yes

no

Hovy et al. 2010. Towards a ”Science” ofcorpus annotation…

1. Building a corpus (data,

metadata)

2. Tagset,guidelines,

tool

3. Pilot with acorpus sample

4. Qualitativeanalysis

(comparingannotators’ decisions)

5. Quantitativeanalysis (inter-

annotatoragreement)

8. Annotatingcorpus

(regularchecks)

10. Corpus publication

or reviewingor correction,

delivery, maintenance

Representative?Balanced?Accessible?

no

yes

Reliable annotation?Stable annotation? Appropriate tagset?

Guidelines?

yes

no

Fort. 2016. Collaborative annotation…

6. Mini-referencecorpus for annotator

training

7. Annotatortraining

(collective, individual)

Learning curves, checks,

Updates to tagset, guidelines

yes

9. Randommanual

checks by experts

1. Building a corpus (data,

metadata)

2. Tagset,guidelines,

tool

3. Pilot with acorpus sample

4. Qualitativeanalysis

(comparingannotators’ decisions)

5. Quantitativeanalysis (inter-

annotatoragreement)

8. Annotatingcorpus

(regularchecks)

10. Corpus publication

or reviewingor correction,

delivery, maintenance

Representative?Balanced?Accessible?

no

yes

Reliable annotation?Stable annotation? Appropriate tagset?

Guidelines?

yes

no

Fort. 2016. Collaborative annotation…

6. Mini-referencecorpus for annotator

training

7. Annotatortraining

(collective, individual)

Learning curves, checks,

Updates to tagset, guidelines

yes

9. Randommanual

checks by experts

Annotation quality

• Reliability & stability à through inter-annotator agreement checks• Reproducibility à agreement of an annotator with himself, intra-

annotator agreement• Random manual checks of the annotations by experts or evaluators

Error taxonomy

Error annotation

• Don’t say the ”E-word”! • Negative connotation (SLA)• Norm deviations – not better, though• Interlanguage phenomenon (Díaz-Negrillo et al., 2009)• Practice-oriented view as a ”non-norm adequate form” (Dobric, 2015)• Unexpected uses (Gaillat et al. 2014)• Cross-disciplinary misunderstanding?

• Ideal to counter-balance error annotation with so called ”can-do” annotation

• à would allow for e.g. CAF analysis (Complexity, Accuracy, Fluency) (Wolfe-Quintero et al., 1998)

• à would probably help (a bit) to cloze the gap between SLA, LCR & NLP

Error annotation

• Don’t say the ”E-word”! (Julia Prentice, EuroSLA, submitted)

• Negative connotation (SLA)• Norm deviations – not better, though• Interlanguage phenomenon (Díaz-Negrillo et al., 2009)• Practice-oriented view as a ”non-norm adequate form” (Dobric, 2015)• Cross-disciplinary misunderstanding?

• Ideal to counter-balance error annotation with so called ”can-do” annotation

• à would allow for e.g. CAF analysis (Complexity, Accuracy, Fluency) (Wolfe-Quintero et al., 1998)

• à would probably help (a bit) to cloze the gap between SLA, LCR & NLP

What’s in a name?That which we call a rose

by any other namewould smell as sweet.

Shakespeare

Error à Correction annotation

Ideal picture (errors + can-do’s)

Linguistic element absent

AbsentNo annotation

Linguistic element present, but in a deviating form

Error-annotated segment / Can-do annotated

segment

Linguistic element present in a correct

form

Can-do annotated segment

phenomenon

annotation

Taxonomy

Taxonomies are like underwear; everyone needs them, but no one wants someone else’s

Anon

Standards are like tooth brushes; everyone likes the idea of them, but no one wants someone else’s

Anon

Egon Stemle, EURAC, Italy

SweLL pre-pilot experiment

• ASK versus Merlin taxonomy• …was used by project researchers on 2 essays (i.e. producing 4 files each)• …time was taken• …experiences were recorded

SweLL pre-pilot experiment

• Summary• It takes twice as long to use Merlin taxonomy• ASK taxonomy (L2 Norwegian) is closer to L2 Swedish• ASK lacks some useful tags• Decision: enrich ASK taxonomy with a few Merlin tags

Taxonomy ambiguity

Taxonomy ambiguity

Taxonomy ambiguity

Normalization

* I has was

• Re-writing L2 learner original in a normative way, creating a so-calledtarget hypothesis (Lüdeling et al., 2005)

Normalization

* I has was à I have been ? I was? I had?

Normalization: basic principles

• Minimal change• Positive assumption• Lexical and grammatical competence prior to functional and

structural correctness

Minimal change…

Example* Jag trivs mycket bor med dem. (Eng) I enjoy much live with them.

Potential target hypotheses:

Jag trivs mycket bra med dem à Minimal change (seemingly) à Error: wrong word / spelling?

Jag trivs mycket med att bo med dem à Lexical competence of BO, verb à Errors: idiomaticity error (trivs) +

wrong verb form (bo)

Why normalization as a separate step?

• It helps to build a better understanding of a learner’s linguisticcompetence

Why normalization as a separate step?

• It helps to build a better understanding of a learner’s linguisticcompetence

• It can be outsourced to SLA researchers for doing it

Why normalization as a separate step?

• It helps to build a better understanding of a learner’s linguisticcompetence

• It can be outsourced to SLA researchers for doing it• Error annotation depends on the change applied to the original text à and as such it is not ERROR annotation, but CORRECTION annotation

Why normalization as a separate step?

• It helps to build a better understanding of a learner’s linguisticcompetence

• It can be outsourced to SLA researchers for doing it• Error annotation depends on the change applied to the original text –and as such is not ERROR annotation, but is CORRECTION annotation• Inter-annotator agreement with respect to error codes can be objectively measured only given that the annotators are working on the same normalized version

SweLL normalization tool• Transformation-based• String matching & calculating diff• Linking on the fly (original – normalized versions)• Parallel text

• Coming (if ever):• Drop-down menus for error codes• Drag-and-drop (spaghetti view)• Three-tier representation (original, spell-corrected, normalized)

• Desired:• Support with automatic spelling error detection

Dan Rosén, developer

Arild Matsson,research engineer

SweLL normalization & error-annotation tool– hands-on demo

• https://spraakbanken.gu.se/swell/dev/

Inter-annotator agreement (IAA), pilot1

What to compare?

115

COCTAILL “ingredients”

116

IAA: How it looks: text example

Freja looks into Jonas's horoscope: You are playful, and if you can choose, you'd spend theday getting to know better somebody you are acquainted with. The evening will beromantic.

And then into her own: The love life is a mess, but otherwise, the day will be funny, sensualand entertaining. Don't work yourself up. You will receive compliments from somebody inyour surrounding.

117

IAA: How it looks: text topics to choose from

118

IAA: How it looks: result(1) culture and traditions, (2) daily life , (3) relations with other people, (4) religion; myth and legends

Intra- & inter-annotator agreement…

• ”…if humans can agree on something at N%, systems will achieve (N-10%)…” (Hovy & Lavid, 2010)

• ”In Sklandica, a Polish treebank, 20% of the agreed annotations were in factwrong.” (Fort, 2016; Wolinski et al., 2011)

• ”Whatever measure(s) is/are employed, the annotation manager has to determine the tolerances: when the agreement is good enough?” (Hovy & Lavid, 2010)

• ”…perhaps it doesn’t matter what the agreement level is, as long as pooragreements are seriously investigated.” (Hovy & Lavid, 2010)

Finally

• Central question in manual annotation: how to obtain reliable, usefuland consistent annotations?

• Annotation in corpora has a theoretical impact: empirical observations à extension/redifinition of theory

• Annotation in corpora has a practical impact: application withinteaching, tool and algorithm building

The NLP community generally is not very concerned with the theoreticallinguistic soundness. The Corpus Linguistics community does not seem

to seek ”reliability” in the annotation process and results.

(Hovy and Lavid, 2010)

Lesson 1

● Do not underestimate the time it takes to collect and prepare data

● Preparing a resource can be a research&developmentproject in itself (e.g. structured input from exercises for VPs and NPs + providing feedback on that)

123

Time-effect ratio consequences

● Researchers skip compiling own data→ use what is available→ in the end often targeting English

Lesson 2

● Take time to study legal regulations, not to waste previously collected data→ There are “loopholes”, but not without information loss

125

Question to you

● In your Master Thesis → Do you plan to collect & prepare data yourself? → or Is data already available?

top related