rodney d. nielsen 1,2, wayne ward 1,2 and james h. martin 1 1 center for computational language and...
Post on 25-Dec-2015
215 Views
Preview:
TRANSCRIPT
Rodney D. Nielsen1,2, Wayne Ward1,2 and James H. Martin1
1 Center for Computational Language and Education Research, CU, Boulder
2 Boulder Language Technologies
Reference Answer: A long string produces a low pitch.
(Lawrence Hall of Science 2006, Assessing Science Knowledge)
A harp has strings of different lengths. Describe how the sound of a longer string
differs from the sound of a shorter string.
When the string gets longer it makes the pitch
lower.
Classification Errors in a Domain-Independent Assessment System
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 2
Tailoring the Tutor’s Response
Question: A harp has strings of different lengths. Describe
how the sound of a longer string differs from the sound of a shorter string.
Reference answer: A long string produces a low pitch.
Learner answers: When the string gets longer it makes the pitch
lower. A long string produces a pitch. It makes a loud pitch. It makes a high pitch. If the string is tighter, the pitch is higher.
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 3
Necessity of Finer-Grained Analysis
Imagine a tutor only knowing that there is some unspecified part of the reference answer that we are not sure the student understands
Reference Answer: A long string produces a low pitch. Break the reference answer down into low-level
facets derived from a dependency parse and thematic roles
NMod(string, long) The string is long. Agent(produces, string) A string is producing
something. Product(produces, pitch) A pitch is being produced. NMod(pitch, low) The pitch is low.
Assess whether an understanding of each facet is implied by the student’s responseA long string produces a low pitch.
det
nmoddetnmod
object
subject
Follow-up Question: Does a long string produce a higher or lower pitch.
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 4
Representing Fine-Grained Semantics
Assess the relationship between the student’s answer and the reference answer facets at a finer grain Reference Ans: A long string produces a low pitch.
NMod(string, long) Agent(produces, string) Product(produces, pitch) NMod(pitch, low)
ExpressedExpressedExpressedUnaddressed
A long string produces a pitch.
YesYesYesNo
AssumedExpressedExpressedDifferent
ArgumentIt produces a loud pitch.
AssumedExpressedExpressedContradiction
Expressed
It produces a high pitch.
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 5
Answer Annotation Labels Understood: Facets that are understood by the student
Assumed: Assumed to be understood a priori based on the question
Expressed: Directly expressed or inferred by simple reasoning Inferred: Inferred by pragmatics or nontrivial logical reasoning
Contradicted: Facets contradicted by the learner answer Contra-Expr: Directly contradicted by negation, antonymous
expressions and their paraphrases Contra-Infr: Contradicted by pragmatics or complex reasoning
Self-Contra: Facets that are both contradicted and implied (self contradictions)
Diff-Arg: The core relation is expressed, but it has a different modifier or argument
Unaddressed: Facets that are not addressed at all by the student’s answer
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 6
Assessment Technology Overview
Start with hand-generated reference answer facets Automatically parse reference & learner answer and
automatically extract representation Extract a feature vector for each reference answer
(RA) facet indicative of the student’s understanding of that facet
From answers, their automatic parses, the relations between these, and external corpus co-occurrence statistics
Train a machine learning classifier on the training set feature vectors
Use classifier to assess the test set answers, assigning one of five Tutor-Labels for each RA facet
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 7
Machine Learning FeaturesThe lexical entailment probabilities for the reference answer facet’s governor and modifier following (Glickman, Dagan and Koppel 2005 .; see also, Turney 2001)
Indicators of whether the reference answer governor’s (modifier’s) stem has an exact match in the learner answer
The lexical entailment probabilities for the primary constituent facets’ governors and modifiers when the facet in question represents a relation between propositions
The part of speech (POS) tags for the facet’s governor and modifier
The dependency or role type labels of the facet and the aligned learner answer dependency
The edit distance between the dependency path connecting the facet’s governor and modifier and the path connecting the aligned terms in the learner answer
True if the facet has a negation and the aligned learner dependency path has a single negation or if neither have a negation
The number of content words in the reference answer, motivated by the fact that longer answers were more likely to result in spurious alignments
Lexical
Syntactic
Other
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 8
Results (C4.5 decision tree)
Results on Tutor-Labels are: 24.4 and 15.4% over majority class baseline 19.4 and 5.9% over lexical baseline
# nonAsmdFacets
MajorityClass
LexicalBaseline
All Features
Training Set 10xCV 54,967 54.6 59.7 77.1
Unseen Answers 3,159 51.1 56.1 75.5
Unseen Modules 30,514 53.4 62.9 68.8
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 9
Error Analysis of Domain-Independent Asmt
Leave-one-module-out cross-validation on the 13 training set science modules
Train on 12 modules test on the held out module; do this for each of the 13 modules
Simulates Unseen Modules (domain-independent) test set
Trained and tested on all non-Assumed facets Analyzed random selection of subset of errors
100 Expressed and 100 Unaddressed Consistently annotated by all annotators
Consider the factors involved in decision by humans
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 10
Errors in Expressed Facets
Four main error factors by frequency: 72% Paraphrases
43% Phrase-based paraphrasing 35% Lexical substitution 26% Coreference 1% Syntactic alternation (Vanderwende et al.
2005) 22% Logical Inference 22% Pragmatics 6% Preprocessing Errors
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 11
Errors in Expressed Facets
43% Phrase-based paraphrasing 32 typical paraphrase occurrences
in the middle versus halfway between one mineral will leave a scratch versus one
will scratch the other 14 uses of concept definitions
circuit versus electrical pathway 6 negations of antonyms
not a lot for a little no one has the same fingerprint for
everyone has a different print
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 12
Errors in Expressed Facets 35% Lexical substitution
Synonymy, hypernymy, hyponymy, meronymy, derivational changes, and other lexical paraphrases
Half detectable by broad coverage resource Tiny for small, CO2 for gas, put for place, pen for
ink and push for carry Many not easily detectable in lexical
resources put the pennies for distribute the pennies, and
have for contain
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 13
Errors in Expressed Facets
26% Coreference Resolution 15 pronouns (11 it, 3 she, 1 one) 6 NP term substitutions
Ref Ans: clay particles are lightLearner Ans: clay is the lightest
6 other common noun coreference issues
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 14
Errors in Expressed Facets 22% Logical inference
no, cup 1 would be a plastic cup 25 ml water and cup 2 paper cup 25 ml and 10 g sugar=> the two cups have the same amount of water
… it is easy to discriminate…=> the two sounds are very different
22% Pragmatics Because the vibrations
=> the rubberband is vibrating … the fulcrum is too close to the earth
=> the earth is the load in the system
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 15
Errors in Expressed Facets
6% Preprocessing errors Normalization issues Parser errors
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 16
Errors in Expressed Facets
Over half of the errors involved more than one of the fine-grained factors There is a shadow there because the
sun is behind it and light cannot go through solid objects. Note, I think that question was kind of dumb.=> the tree blocks the light
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 17
Errors in Unaddressed Facets
Many are questionable annotations You could take a couple of cardboard
houses and … 1 with thick glazed insulation. … =/> installing the insulation in the houses
Because the darker the color the faster it will heat up =/> darkest color
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 18
Errors in Unaddressed Facets
Biggest source of error: lexical similarity Ignorance of context
[the electromagnet] has to be iron…=> steel is made from iron
Antonyms closer versus greater distance and
absorbs energy versus reflects energy Misguided trust
I learned it in class
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 19
Conclusion New assessment paradigm
Fine-grained facets and labels Corpus of 146K fine-grained inference
annotations Answer assessment system
24.4 and 15.4% over baseline results for in-domain and out-of-domain, respectively
First successful assessment of Grade 3-6 constructed responses
Error analysis provides insight into where future work is most appropriate
top related