natural language processing to improve student engagement featuring dr. rebecca passonneau
TRANSCRIPT
CollaboratorsPyramid content evaluation: Ani Nenkova, (Columbia, U Penn; 2004)
Automated scoring by unigram overlap: Ani Nenkova, Aaron Harnly, Owen Rambow (Columbia; 2005)
Automated scoring by distributional semantics: Emily Chen, Ananya Poddar, Guarav Gite (Columbia; 2013
- 2016)
Comparison to educational rubric (main ideas): Dolores Perin (Teachers; 2013 - 2016)
Automated pyramid and scoring by triple extraction and similarity graphs based on WordNet: Qian Yang
(Tsinghua; PSU; 2016), Alisa Krivokapic (Columbia; 2016)
Automated pyramid and scoring by parsing, distributional semantics, and novel bin packing algorithm:
Yanjun Gao (Penn State; 2017)
2
Psychologists have posited three cognitive processes involved
in summarization:
● selection of important ideas
● generalization to omit detail
● inference of implicit connections
4
Summaries that are equally good will have some ideas in
common, and some differences
Very much like a Venn diagram
5
Summaries that are equally good will have some ideas in
common, and some content differences
Idea 2
Idea 3
Idea 1
Idea 4
Idea 8
Idea 7
Idea 9
Idea 11
Idea 10
Idea 12
Idea 5
Idea 4
Idea 6
Very much like a Venn diagram
6
Designing a reliable rubric to measure how many important ideas each
summary contains is labor intensive and potentially subjective
7
Summaries are concise
● Each idea is expressed once○ Selection of important ideas
○ Omission of unnecessary detail
● Content evaluation task has two steps
○ Define a standard from expert summaries -- the distinct ideas weighted by
importance
○ Compare the summaries to the standard -- quantify the proportion of important
ideas
8
Pyramid summary content annotation builds a content model of distinct ideas
from summaries written by a wise crowd (size N)
9
CU 2
CU 3
CU 1
CU 4
CU 8
CU 7
CU 9
CU 11
CU 10
CU 12
CU 5
CU 4
CU 6
What is pyramid content analysis?
Importance of ideas (content units, or CUs)
● Emerges from the wise crowd
● Distinguishes quality of ideas by quantity of occurrence
● Simple but effective
Pyramid summary content annotation builds a content model of distinct ideas
from N reference summaries written by a wise crowd (size N)
10
CU 2
CU 3
CU 1
CU 4
CU 8
CU 7
CU 9
CU 11
CU 10
CU 12
CU 5
CU 4
CU 6
A list of all the distinct ideas or Content Units (CUs), and their
weights, i.e., how many summaries each occurs in
TEXT: WHAT IS MATTER
CU 1: Matter is classified by physical and chemical properties
W=3
CU 3: All matter has energy
W=2
. . .
CU 12: Matter can be a solid, liquid or gas
W=1
What is pyramid content analysis?
Application of Pyramid Content Model
● In a new summary, find all the phrases that mention a model CU
● Sum the weights of the mentioned CUs
● Normalize the sum
12
5
4
2
Raw sum = 5 + 4 + 2 = 11
What is wide crowd content analysis?
Normalization
● A summary can express each CU once at most
● Sum the weights of the identified CUs
● Normalize the sum in one of two ways:
○ QUALiTY: The maximum sum of weights for the same number of CUs
Did the summary mention mostly important ideas?
○ COVERAGE: The maximum sum of weights for the average number of CUs in
the reference summaries
Did the summary mention most of the important ideas?
13What is wide crowd content analysis?
● 9 reference summaries
● All content models with m
summaries, for m ∈ [1,9]
● All pairs of summaries A, B where
A > B using 9 reference summaries
● Result○ The variance around scores for A and B
diverges given 4 to 5 references
● Conclusion○ No misranking with 5 references
14How reliable is wide crowd content analysis?
How reliable is it? Can misranking errors occur?
Five additional reliability tests
15
Number of reference summaries for probability of
misranking to be ≤ 0.1
5
Number of reference summaries for scores to
correlate with gold standard scores
5
Interannotator agreement on content model, 10
different pairs of models
0.71 to 0.89
Interannotator agreement on application of content
model to new summaries, 5 models
0.77 to 0.81
Correlation of scores of 16 systems using different
content models
0.71 to 0.96
How reliable is wide crowd content analysis?
Key differences between manual and automated methods:
Humans
● Consider a few alternative segmentations
● Sameness of meaning is a binary (yes-no) judgement
Automated methods
● Consider many possible segmentations
○ Simpler decisions
○ Many more of them
● Metric for similarity of meaning is graded from 0 to 1
● Must select the optimal segmentations and meaning similarities
16How did we automate wise crowd content analysis?
Human segmentation into “ideas” and similarity
Sentence: Matter can be measured because it contains volume and mass
CU106: Matter has volume and mass (W=4)Ref Sum 1: because it contains both volume and mass
Ref Sum 2: it takes up space defined as volume and contains . . . mass
Ref Sum 3: Matter is anything that has mass and takes up space (volume)
Ref Sum 4: Matter contains volume and mass
17How did we automate wise crowd content analysis?
Three Automated Methods
● No large scale machine learning required
● All components are pre-trained
● Requires only 5 wise-crowd summaries on same summarization task
18How did we automate pyramid content analysis?
Three Automated Methods
● PyrScore: Requires existing manual content model
○ Brute force segmentation -- considers all possibilities
○ Distributional (statistical) semantics
● PEAK:
○ Open Information Extraction tools extracts subj-pred-obj triples
○ Symbolic semantics (WordNet)
● PyrEval:
○ Sentence decomposition into clauses
○ Distributional (statistical semantics)
19How did we automate pyramid content analysis?
PyrScore Segmentation: Brute Force
● Calculates all ngram segmentations of each sentence in a new summary
All | matter | has | energy | volume | and | mass 7 unigrams
All | matter | has | energy | volume | and | mass 5 unigrams + 1 bigram
All | matter | has | energy | volume | and | mass 5 unigrams + 1 bigram
. . .
All matter has | energy | volume | and | mass 4 unigrams + tri gram
. . .
All matter has energy volume and mass 1 7gram
20How did we automate wise crowd content analysis?
PyrScore Semantics
● Generates a latent vector representation of each phrase in a CU
CU106: Matter has volume and mass (W=4)because it contains both volume and mass
it takes up space defined as volume and contains a certain amount of material defined as mass
Matter is anything that has mass and takes up space (volume)
Matter contains volume and mass
21How did we automate wise crowd content analysis?
● Latent semantics:
○ Weighted Text Matrix Factorization (WTMF;
Guo and Diab, 2012)
○ Assigns small weight to unseen words
○ Word vectors trained offline
PyrScore Scoring
● Generates a WTMF vector representation of each CU phrase
● Generates a WTMF vector representation of each segment in a new
summary
● Similarity to CU is the average cosine similarity to all phrases in the CU
● Optimal assignment of candidate ngrams to CUs
○ A maximum weighted independent set problem
○ Applies a greedy algorithm (WMIN; Sakai et al 2003)
22How did we automate wise crowd content analysis?
PEAK (Pyramid Evaluation by Automated Knowledge Extraction)
● Segmentation: Applies Open Information Extraction tools to extract Subj-
Pred-Obj (SPO) triples from sentences
Matter can be detected and measured because it contains volume and mass
Subj(Matter) Pred(Detected and measured) Obj(because it contains volume and
mass)
Subj(Matter) Pred(contains) Obj(volume and mass)
. . .
● Semantics: Uses explicit representation of meaning (random walks over
WordNet)
23How did we automate wise crowd content analysis?
PEAK Aligns SPO Triples
● From different reference summaries to construct the model
● Uses a hypergraph
○ Triples are hyperedges of SPO nodes
○ Edges between nodes are semantic similarity
● Each CU is a weighted triple
24How did we automate wise crowd content analysis?
PEAK Aligns SPO Triples
● Each CU is a weighted triple
● New summary is a list of triples
● Edges in bipartite graph added from CUs to SPOs
if semantic similarity ≥ 0.50
● Uses the Munkres-Kuhn algorithm with CU
weights as edge costs
25How did we automate wise crowd content analysis?
PyrEval extends PyrScore
● Builds full pyramid using new weighted independent set algorithm
● Decomposes sentences into syntactically meaningful units (roughly clauses)
● Uses the same distributional semantics○ WTMF performs better than Word2Vec
○ WTMF performs better than Glove
● Uses the same scoring algorithm
26
PyrEval constructs a pyramid by a novel set allocation method
● Nested sets
○ Every sentence has a set of segmentations, only one of which can be
selected
○ Every CU is a set of segments, each from a different summary
○ Every pyramid layer is a set of CUs
27
EDUA: Emergent discovery of units of attraction
28
● Constructs a graph
○ Nodes are segments
○ Edges weighted by force of “attraction” (e.g.,
semantic similarity)
● Edge types
○ Dashed edges: attraction(ni,nj) > 𝛂○ Solid edges: connect segments into CUs
Assignment of segments to a CU obeys constraints
● Maximize the average Weighted Avg Similarity within each pyramid layer n
● Capacity of each layer y given segments x
● Relative size of each layer
● No empty layers
● One segmentation per sentence; at most one CU per segment
29
PyrEval and humans construct similar pyramids
● CUs: 69 (PyrEval) versus 60 (Annotator 1) or 46 (Annotator 2)
● Similar distribution○ PyrEval: 1 w5, 2 w4, 7 w3, 22 w2, 37 w1
○ Annotator 1: 3 w5, 7 w4, 13 w3, 15 w2, 22 w1
● Example same weight○ PyrEval (w5): Physical props can occur without changing the identity or nature of the matter
○ Annotator 1 (w5): Physical props can be observed without changing the identity of the matter
● Example different weight○ PyrEval (w4): Unlike physical change, chemical change occurs when the chemical properties
of the matter have changed and a new substance is produced
○ Manual (w3): The difference between a physical change and a chemical change is that a
chemical change creates a new substance
30
A Rubric for Contextualized Curricular Support
● From a study of 16 community college classrooms
● 120 students wrote summaries of a middle school text,
What is Matter?○ Read the passage
○ Answered main ideas questions
○ Wrote the summary
● Researchers identified 14 main ideas
● Main ideas score of a summary: % of main ideas ○ Included partial credit
○ Interrater reliability: Pearson correlation: 0.92
31What assessment rubric did we compare it to?
Pearson correlations of automated and manual methods
32
Correlation
PyrScore 0.95
PEAK 0.82
PyrEval 0.87
What were the results?
Pearson correlations of 120 Main Ideas scores and automated methods
33
Manual Test 120
PyrScore 0.83
PEAK 0.70
What were the results?
Content scores are transparent, can support feedback
● Does the summary have enough important ideas, given its length? (Quality
score)
● Does the summary have enough important ideas, given the set of possible
important ideas (Coverage score)
● Does the summary have a good balance of both (Comprehensive score)
● Which important ideas were expressed?
● Which important ideas were missed?
34
Conclusion
● Wise Crowd Content Analysis
○ Works well to identify important ideas
○ Importance emerges from the wise crowd
○ Correlates with an independently developed main ideas rubric
○ Requires only 5 reference summaries
● Fully automated methods: PyrEval and PEAK
○ Pretrained methods, and parameter tuning on small development set
○ Perform less well if sentences are very complex (e.g., automatic
summarizers on newswire)
○ Potential to inform revision
35Conclusion
What’s Next? Content assessment of essays
● Same ideas are referenced multiple times in the same essay, through
multiple means
○ Paraphrase, definite descriptions (“the evidence shown
here”), deictic pronouns (“This indicates . . .”)
○ Will require more complex methods to detect “the same” idea
● Discourse structure and function
○ Interrelations among ideas within the text
○ Discursive versus argumentative
36What’s next?