ground truth for behavior detection week 3 video 1
TRANSCRIPT
Ground Truth for Behavior Detection
Week 3 Video 1
Welcome to Week 3
Over the last two weeks, we’ve discussed prediction models
This week, we focus on a type of prediction models called behavior detectors
Behavior Detectors
Automated models that can infer from log files whether a student is behaving in a certain way
We discussed examples of this off-task behavior and gaming detectors
In the San Pedro et al. case study in week 1
Behaviors people have detected
5
Disengaged Behaviors
Gaming the System (Baker et al., 2004, 2008, 2010; Cheng & Vassileva, 2005; Walonoski & Heffernan, 2006; Beal, Qu, & Lee, 2007)
Off-Task Behavior (Baker, 2007; Cetintas et al., 2010)
Carelessness (San Pedro et al., 2011; Hershkovitz et al., 2011)
WTF Behavior (Rowe et al., 2009; Wixon et al., UMAP2012)
6
Meta-Cognitive Behaviors
Help Avoidance (Aleven et al., 2004, 2006)
Unscaffolded Self-Explanation (Shih et al., 2008; Baker, Gowda, & Corbett, 2011)
Exploration Behaviors (Amershi & Conati, 2009)
If you’re not interested in behavior detectors…
Feel free to take the week off and rejoin us in a week
Although there may still be stuff that’s interesting to you this week
Ground Truth
The first big issue with developing behavior detectors is…
Where do you get the prediction labels?
Behavior Labels are Noisy
No perfect way to get indicators of student behavior
It’s not truth It’s ground truth (Truthiness?)
Behavior Labels are Noisy
Another way to think of it
In some areas, there are “gold-standard” measures As good as gold
With behavior detection, we have to work with “bronze-standard” measures Gold’s less expensive cousin
Is this a problem?
Not really
It does limit how good we can realistically expect our models to be
If your training labels have inter-rater agreement of Kappa = 0.62
You probably should not expect (or want) your detector to have Kappa = 0.75
Ultimately
“Anything worth doing is worth doing badly.” – Herb Simon
Picture taken from CMU tribute page
Sources of Ground Truth for Behavior
Self-report Field observations Text replays Video coding
Self-report
Fairly common for constructs like affect and self-efficacy
Not as common for labeling behavior
Are you gaming the system right now?
Yes No
Was recommended to me by an IRB compliance officer in 2003
Are you gaming the system right now?
Yes No
Field Observations
One or more observers watch students and take systematic notes on student behavior
Takes some training to do right (Ocumpaugh et al., 2012) http://www.columbia.edu/~rsb2162/bromp.html
Free Android App developed by my group (Baker et al., 2011) http://www.columbia.edu/~rsb2162/HART8.6.zip
Text replays
Pretty-prints of student interaction behavior from the logs
Examples
Major Advantages
Blazing fast to conduct 8 to 40 seconds per observation
Notes
Decent inter-rater reliability is possible
Agree with other measures of constructs
Can be used to train behavior detectors
Major Limitations
Limited range of constructs you can code
Gaming the System – yes Collaboration in online chat – yes
(Prata et al, 2008) Frustration, Boredom – sometimes Off-Task Behavior outside of software – no Collaborative Behavior outside of
software – no
Major Limitations
Lower precision (because lower bandwidth of observation)
Video Coding
Can be videos of live behavior in classrooms or screen replay videos
Slowest method Replicable and precise Some challenges in camera positioning
If you don’t get this right, it can be difficult to code your data
Inter-Rater Reliability
Good to check for inter-rater reliability when using expert coding
Researchers typically try to shoot for Kappa = 0.6 or higher for the expert coding But ignore magic numbers… Any real signal is something you can try to
re-capture I’d rather have 1,000 data points with
Kappa = 0.5 Than 100 data points with Kappa = 0.7
Once you have ground truth… You can build your detector!
Once you’ve synchronized your ground truth to the log data…
Next Lecture
Data Synchronization and Grain-Sizes