2015edm: feature-aware student knowledge tracing tutorial

Introduction to Feature-Aware Student Knowledge Tracing (FAST) Model and Toolkit

José P. González-Brenes, Pearson Yun Huang, University of Pittsburgh Acknowledging: Peter Brusilovsky, University of Pittsburgh

Outline •  Introduction •  FAST – Feature-Aware Student Knowledge Tracing

•  Toolkit 1-2-3 •  Walk-through examples

1.  Item difficulty 2.  Temporal Item Response Theory

•  Conclusion

Motivation

•  Personalize learning of students – For example, teach students new material as

they learn, so we don’t teach students material they know

•  How? Typically with Knowledge Tracing

û û ü ü û û ü ü ü

û û ü ü ü

û û ü ü

û û ü ü ü

û û ü ü

Learns a skill or not

•  Knowledge Tracing fits a two-state HMM per skill

•  Binary latent variables indicate the knowledge of the student of the skill

•  Four parameters: 1.  Initial Knowledge 2.  Learning 3.  Guess 4.  Slip

Transition

Emission

What’s wrong?

•  Only uses performance data (correct or incorrect) •  We are now able to capture feature rich data

– MOOCs & intelligent tutoring systems are able to log fine-grained data

– Used a hint, watched video, after hours practice…

•  … these features can carry information or intervene on learning

What’s a researcher gotta do?

•  Modify Knowledge Tracing algorithm •  For example, just on a small-scale

literature survey, we find at least nine different flavors of Knowledge Tracing

Are all of those models sooooo different? •  No! we identify three main variants •  We call them the “Knowledge Tracing

Family”

Knowledge Tracing Family

No features Emission (guess/slip)

Transition (learning)

Both (guess/slip and

learning)

•  Item difficulty (Gowda et al ’11; Pardos et al ’11)

•  Student ability (Pardos et al ’10)

•  Subskills (Xu et al ’12)

•  Help (Sao Pedro et al ’13)

•  Student ability (Lee et al ’12; Yudelson et al ’13)

•  Item difficulty (Schultz et al ’13)

•  Help (Becker et al ’08)

•  Each model is successful for an ad hoc purpose only – Hard to compare models – Doesn’t help to build a

cognition theory

•  Learning scientists have to worry about both features and modeling

•  These models are not scalable: – Rely on Bayes Net’s

conditional probability tables – Memory performance grows

exponentially with number of features

– Runtime performance grows exponentially with number of features (with exact inference)

Example:

Knowledge p(Correct)

False (1) 0.10 (guess)

True (2) 0.85 (1-slip)

20+1 parameters!

Emission probabilities with no features:

Example: Emission probabilities with 1 binary feature:

Knowledge Hint p(Correct)

False False (1) 0.06

True False (2) 0.75

False True (3) 0.25

True True (4) 0.99

21+1 parameters!

Example: Emission probabilities with 10 binary features: Knowledge F1 … F10 p(Correct)

False False False False (1) 0.06

… …

True True True True (2048) 0.90

210+1 parameters!

1.  Low Complexity with Many Features 2.  Flexible Feature Engineering 3.  Flexible Parameterization 4.  High Predictive Performance, Plausibility and

Consistency •  Toolkit 1-2-3 •  Walk-through Examples

•  Conclusion

Something old… k

f f •  Uses the most general model

in the Knowledge Tracing Family

•  Parameterizes learning or emission (guess and slip) probabilities

Something new… k

f f •  Instead of using inefficient

conditional probability tables, we use logistic regression [Berg-Kirkpatrick et al’10 ]

•  Exponential complexity -> linear complexity

Example (guess and slip):

# of features # of parameters in KT # of parameters in FAST*

10 2048 22

25 67,108,864 52

52 features are not that many, and yet they can become intractable with Knowledge Tracing Family

* Parameterizing guess and slip probabili4es without sharing features.

Something blue? k

f f •  Not a lot of changes to

implement prediction •  Training requires quite a bit of

changes – We use a recent modification of

the Expectation-Maximization algorithm proposed for Computational Linguistics problems [Berg-Kirkpatrick et al’10 ]

KT uses Expectation-Maximization

Conditional Probability

Table Lookup / Update

Latent Knowledge Estimation

E-Step:Forward-Backward algorithm

M-Step: Maximum Likelihood

“Conditional Probability

Table” Lookup / Update

Logistic Regression

Weights Estimation

FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ]

E-step

Slip/guess lookup:

Mastery p(Correct)

False (1)

True (2)

Use the multiple parameters of logistic regression to fill the

values of a conditional

probability table! [Berg-Kirkpatrick et al’10 ]

“Conditional Probability

Table” Lookup / Update

Logistic regression

weights Estimation

FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ]

Instance Weights for

Logistic Regression

Train a weighted logistic regression !

P(hidden | Observed), i.e., P(Learnedt|O) The probability of being in Learned state at tth practice given a student’s practice sequence.

observation 1observation 2

observation n

... ... ...

observation n

{ { {active when

masteredactive when not masteredalways active

Features:Instance weights:

P(Learned1|O) P(Learned2|O)

P(Learnedn|O)

P(Unlearned1|O) P(Unlearned2|O)

P(Unlearnedn|O)

active when learned

active when unlearned

Feature Design Matrix for slip/guess logistic regression

Parameterization example

•  To model the impact of example usage, we construct a binary example feature Et: whether a student clicked an example before current practice.

•  This feature affects the Guess and Slip probabilities: when a student checked an example, he/she has higher probability to guess? (and lower probability to slip?)

feature Et for Slip

bias for Slip

feature Et for Guess

bias for Guess

0 1 0 0 1 1 0 0

0 0 0 1 0 0 1 1

•  Mary attempted a problem twice. •  On the 1st attempt, she failed. •  She checked an example. On the 2nd attempt, she succeeded.

Outcome

incorrect

correct

incorrect

correct

original data for learned state

a copy of the data for unlearned state

P(Learnedt|O)

0.3 0.6 0.7 0.4

standard logistic regression weight for each observation

weighted logistic regression to train coefficients

s2 s1 g2 g1

observation n

... ... ...

observation n

{ { {active when

masteredactive when not mastered

always active

Features:Instance weights:

Slip/Guess logistic regression

When FAST uses only intercept terms as features for the two levels of mastery, it is equivalent to Knowledge Tracing!

March 28, 2014 31

7,100 11,300 15,500 19,8000

0.08 0.10 0.12 0.15

# of observations

BNT!SM (no feat.)

FAST (no feat.)

FAST is 300x faster than BNT-SM!

(On an old laptop, no parallelizaLon, nothing fancy)

BNT-SM vs FAST

•  BNT-SM contains other functionalities that FAST doesn’t have. For example, it allows different ways to learn parameters.

•  We recommend you to explore different tools to find the best fit.

•  Conclusion

What kind of features can we put?

•  Item Dummies Incorporating item difficulties

•  Student Dummies Incorporating student abilities

•  Item and Student Dummies Temporal Item Response Theory, incorporating both item difficulties and student abilities

•  Subskill Dummies Incorporating subskill difficulties

What kind of features can we put?

•  Binary hint features à Consider whether a student requested a hint or not

•  Binary example features à Consider whether a student checked an example or not …

Consistency •  Toolkit Setup •  Examples

1.  Item difficulty 2.  Multiple subskills 3.  Temporal Item Response Theory

•  Conclusion

What can we parameterize? •  For example, to model the impact of example

usage, we can consider following parameterizations with a binary example feature:

(Huang et al. ’15)

1.  Low Complexity with Many Features 2.  Flexible Feature Engineering 3.  Flexible Parameterization 4.  High Predictive Performance, Plausibility and Consistency

•  Toolkit 1-2-3 •  Walk-through Examples

•  Conclusion

Beyond higher predictive performance….

•  FAST promises higher predictive performance than Knowledge Tracing with proper feature engineering.

•  Moreover, it increase model plausibility and consistency.

•  Details in our paper in EDM 2015 (http://www.educationaldatamining.org/EDM2015/uploads/papers/paper_164.pdf). A quick introduction of how the FAST toolkit addresses these issues: –  By specifying #random restarts, it automatically picks

the one with the maximum log likelihood on train set, –  It outputs plausibility evaluation metrics.

•  Conclusion

Toolkit Setup -- input 1.  Download the latest release from

https://github.com/ml-smores/fast/releases 2.  Decompress the file (fast-2.1.0-release.zip). The main files

for starting are: •  fast-2.1.0-final.jar •  data/item_exp/FAST+item1.conf (configuration file) •  data/item_exp/train0.csv, test0.csv (data)

3.  Open a terminal, go to the directory where fast-2.1.0-final.jar locates, and type:

java -jar fast-2.1.0-final.jar ++data/item_exp/FAST+item1.conf Details can be found in our wikihttps://github.com/ml-smores/fast/wiki

•  XXX_Prediction.csv – P(Correct) – Knowledge estimation: P(Learned|O) …

•  XXX_Evaluation.csv – Overall AUC, mean AUC …

•  XXX_Parameters.csv – Non parameterized – Parameterized: feature weights

•  Runtime.log

Toolkit Setup -- output data/item_exp/

•  Conclusion

Modeling item difficulty

•  Within the same skill, students may perform well on easier items (problems), and worse on harder items.

•  Probably harder items have lower guess and higher slip?

•  We use binary item dummies (indicators) as features to parameterize guess and slip probabilities.

Results on the Java dataset

Overall AUC Mean AUC Knowledge Tracing .71 ± .01 .58 FAST+item .75 ± .01 .68

6% improvement

•  Java tutoring system Quizjet (Hsiao et al ’10) •  20,808 observations, 19 skills, 110 students, 70% correct. •  Randomly select 80% in train, 20% in test •  Parameterizing emission

17% improvement

95% confidence intervals

Current experiment •  Here, we experiment on a public dataset from PSLC

Datashop, the Geometry dataset (Koedinger et al ’10). •  5,055 observations, 18 skills, 59 students, 75% correct. •  Randomly selected 80% students in train, remaining in test.

Model #Random restart Parameterization KT1 1 / KT2 20 / FAST+item1 1 emission FAST+item2 20 emission FAST+item3 20 initial, transition, emission

Have a look at the input data…

Required columns for KT and FAST Feature columns for FAST models

data/item_exp/train0.csv

Have a look at the configuration file KT1.conf

modelName KT1 parameterizing false parameterizingInit false parameterizingTran false parameterizingEmit false forceUsingAllInputFeatures false

nbRandomRestart 1

inDir ./data/item_exp/ outDir ./data/item_exp/ trainInFilePrefix train testInFilePrefix test inFileSuffix .csv EMMaxIters 500 LBFGSMaxIters 50 EMTolerance 1.0E-6 LBFGSTolerance 1.0E-6

data/item_exp/KT1.conf

Let’s run Knowledge Tracing baseline first … •  java -jar fast-2.1.0-final.jar ++data/item_exp/KT1.conf •  Open KT1_Evaluation.csv (data/item_exp/)

Model #restart Overall AUC Mean AUC Time(s)

KT1 1 .71 .55 1

•  KT2.conf only changes nbRandomRestarts to 20 •  java -jar fast-2.1.0-final.jar ++data/item_exp/KT2.conf •  Open KT2_Evaluation.csv

KT2 20 .71 .56 11

Let’s run FAST with item features …

modelName FAST+item1 parameterizing true parameterizingInit false parameterizingTran false parameterizingEmit true forceUsingAllInputFeatures true nbRandomRestart 1

data/item_exp/FAST+item1.conf

Let’s run FAST+item parameterizing emission probabilities …

•  Run FAST+item1: java -jar fast-2.1.0-final.jar ++data/item_exp/FAST+item1.conf

•  Run FAST+item2 (nbRandomRestart=20): java -jar fast-2.1.0-final.jar ++data/item_exp/FAST+item2.conf

•  Open FAST+item1_Evaluation.csv, FAST+item2_Evaluation.csv

KT2 20 .71 .56 11

FAST+item1 1 .71 .58 10

FAST+item2 20 .72 .60 145

7% improvement

What about parameterizing all the probabilities?

modelName FAST+item3 parameterizing true parameterizingInit true parameterizingTran true parameterizingEmit true forceUsingAllInputFeatures true nbRandomRestart 1

data/item_exp/FAST+item3.conf

What about parameterizing all the probabilities?

•  java -jar fast-2.1.0-final.jar ++data/item_exp/FAST+item3.conf •  Open FAST+item3_Evaluation.csv •  For running parameterizing all probabilities with 20 restarts, we need

more than 7 minutes, yet we can get the same result with only 1 restart (FAST+item3).

Model #restart Parameterization

Overall AUC

Mean AUC

Time (s)

KT2 20 / .71 .56 11

FAST+item3 1 Initial, transition, emission

.72 .62 27

11% improvement

•  Conclusion

Two paradigms: (50 years of research in 1 slide) •  Knowledge Tracing

– Allows learning – Every item = same difficulty – Every student = same ability

•  Item Response Theory – NO learning – Models items difficulties – Models student abilities

Can FAST help merging the paradigms?

Item Response Theory

•  The simplest of its forms, it’s the Rasch model

•  The Rasch can be formulated in many ways: – Typically using latent variables – Logistic regression

•  a feature per student •  a feature per item •  We end up with a lot of features! – Good thing we

are using FAST ;-)

Results on the Java dataset

Overall AUC Mean AUC Knowledge Tracing .67 ± .03 .56 FAST + IRT .76 ± .03 .70

13% improvement

•  Java tutoring system Quizjet (Hsiao et al ’10) •  6,549 observations (first attempts), 60% correct. •  Randomly select 50% students in train, for remaining students,

place the first half of practices in train and predict the rest •  Only consider parameterizing emission

25% improvement

Current experiment •  We choose one skill, “Nested Loops” from the Java dataset

(Caution: this is a private dataset, please don’t distribute this subset).

•  Randomly select 50% students in train, for remaining students, place the first half of practices in train and predict the rest

•  Only consider parameterizing emission •  The toolkit can automatically generate student and item dummy

features according to “student” and “problem” columns from train and test sets.

•  Here, we force both hidden states share features, which means the student ability or item difficulty remains the same whether the student is in learned or unlearned states.

Have a look at the input data… Train set

Test set

Training Datapoints

Test Datapoints

•  We need to put the entire skill-student sequence in test set (using “fold” column to differentiate datapoints from train or test). This is for allowing the toolkit updating knowledge estimation by historical practices.

Have a look at the configuration file for FAST …

modelName FAST+IRT1 parameterizing true parameterizingInit false parameterizingTran false parameterizingEmit true forceUsingAllInputFeatures true generateStudentDummy true generateItemDummy true

nbRandomRestart 1

data/IRT_exp/FAST+IRT1.conf

Let’s run baselines and FAST+IRT models …

•  Type the following command consecutively for 1) KT1.conf, 2) KT2.conf, 3) FAST+item1.conf, 4) FAST+item2.conf

java -jar fast-2.1.0-final.jar ++data/IRT_exp/XXX.conf •  Open XXX_Evaluation.csv

Model #restart AUC Time(s)

KT1 1 .60 <1

KT2 20 .59 1

FAST+IRT1 1 .71 3

FAST+IRT2 20 .73 39

24% improvement

•  Conclusion

Comparison of existing techniques

March 28, 2014 64

allows features

slip/ guess

recency/ordering learning

FAST ✓ ✓ ✓ ✓ PFA Pavlik et al ’09 ✓ ✗ ✗ ✓ Knowledge Tracing Corbett & Anderson ’95 ✗ ✓ ✓ ✓ Rasch Model Rasch ’60 ✓ ✗ ✗ ✗

•  FAST lives by its name •  FAST provides high flexibility in utilizing

features, and as our studies show, even with simple features improves significantly over Knowledge Tracing

•  The effect of features depends on how smartly they are designed and on the dataset

•  We are looking forward for more clever uses of feature engineering for FAST in the community.

Thank you!

Multiple subskills

•  Experts annotated items (question) with a single skill and multiple subskills

Multiple subskills & KnowledgeTracing •  Original Knowledge Tracing can not

model multiple subskills •  Most Knowledge Tracing variants assume

equal importance of subskills during training (and then adjust it during testing)

•  State of the art method, LR-DBN [Xu and

Mostow ’11] assigns importance in both training and testing

FAST can handle multiple subskills

•  Parameterize learning •  Parameterize slip and guess

•  Features: binary variables that indicate presence of subskills

FAST vs Knowledge Tracing: Slip parameters of subskills

•  Conventional Knowledge assumes that all subskills have the same difficulty (red line)

•  FAST can identify different difficulty between subskills

•  Does it matter?

subskills within a skill:

State of the art (Xu & Mostow’11)

•  The 95% of confidence intervals are within +/- .01 points

Model AUC

LR-DBN .71

KT - Weakest .69 KT - Multiply .62

Benchmark Model AUC

LR-DBN .71 Single-skill KT .71 KT - Weakest .69 KT - Multiply .62

•  The 95% of confidence intervals are within +/- .01 points •  We are testing on non-overlapping students, LR-DBN was

designed/tested in overlapping students and didn’t compare to single skill KT

Benchmark Model AUC

LR-DBN .71 Single-skill KT .71 KT - Weakest .69 KT - Multiply .62

•  The 95% of confidence intervals are within +/- .01 points •  We are testing on non-overlapping students, LR-DBN was

designed/tested in overlapping students and didn’t compare to single skill KT

Benchmark

•  The 95% of confidence intervals are within +/- .01 points

Model AUC FAST .74 LR-DBN .71 Single-skill KT .71 KT - Weakest .69 KT - Multiply .62

Have a look at the input data…

data/others/FAST+subskill_train0.csv

Let’s run FAST+subskill models …

•  Move to “data/others/” folder •  Copy fast-2.1.0-final.jar under this folder •  Create “input” folder under this folder, and put FAST

+subskill_train0.txt and FAST+subskill_test0.txt under it. •  Type the following command for

java -jar fast-2.1.0-final.jar ++FAST+subskill.conf

2015edm: feature-aware student knowledge tracing tutorial

knowledge tracing family

knowledge pcorrect false

knowledge tracing algorithm

initial knowledge

guess true

binary features

learning of students

emission probabilities

Technology

time aware models - enterprise architect · time aware...

feature-aware label space dimension reduction for multi...

a 6k-mac feature-map-sparsity- aware neural processing

provenance-aware tracing of worm break-in and contaminations

a fast and stable feature-aware motion blur filter · and...

lncs 8258 - on stopping rules in dependency-aware feature...

context-aware feature and label fusion for facial...

ekt: exercise-aware knowledge tracing for student

tafe-net: task-aware feature embeddings for low shot...

lprov: practical library-aware provenance tracing

modality and component aware feature fusion for …...cal...

a fast and stable feature-aware motion blur filter -...

daily-aware personalized recommendation based on feature...

tafe-net: task-aware feature embeddings for low shot...

wabi-sabi, mono no aware, and ma: tracing traditional...

general features in knowledge tracing: applications to...

entracer - entropy sensitive ﬁngerprint feature...

a tracing technique for understanding the behavior of...

tracing 1. contents why tracing why tracing tracing in...

spatial-aware feature aggregation for image based cross...