rohan's masters presentation

75
``What do you know?'' Latent feature approach for the Kaggle's GrockIt challenge Rohan Anil Advised by Prof. Charles Elkan collaboration with Aditya Menon UC San Diego March 19, 2012

Upload: rohananil

Post on 19-Aug-2015

313 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Rohan's Masters presentation

``What do you know?' ' Latent feature approach for the

Kaggle's GrockIt challenge

Rohan AnilAdvised by Prof. Charles Elkancollaboration with Aditya Menon

UC San DiegoMarch 19, 2012

Page 2: Rohan's Masters presentation

Outline● Introduction

● Kaggle.com● GrockIt● ``What do you know?' ' Challenge

● Latent Feature Log-Linear (LFL)● Ensemble Learning● Our Results● Q/A

Page 3: Rohan's Masters presentation

Kaggle.com

Page 4: Rohan's Masters presentation

' 'What do you know?' ' - Competition

1st Prize : 3000$ 2nd Prize : 1500$ 3rd Prize : 500$

Page 5: Rohan's Masters presentation

GrockIt.com

Page 6: Rohan's Masters presentation

Dataset

Training Set

4,851,476 outcomes of students answering various questions

Outcomes

Four types:-

i) correct ii) incorrect iii) skipped iv) timed-out.

Students practicing for competitive exams

i) GMAT, ii) ACT and iii) SAT

Page 7: Rohan's Masters presentation

Dataset

Page 8: Rohan's Masters presentation

DatasetDifferences between training set and test set are:-

BiasBiased towards users who have answered more questions.

#ResponeOnly one response per student

TemporalOutcomes are latter in time than the training responses and validation responses of that student.

OutcomesTest set distribution is different from training set,it does not include timed-out or skipped outcomes.

Page 9: Rohan's Masters presentation

Baseline

Rasch BaselineA baseline was provided by Kaggle for the dataset.

Page 10: Rohan's Masters presentation

Bs - ability of the student 's'

δq - difficulty of question 'q'

For a given student 's' ( Fixed Bs )

– The probability of answering a question is only dependent on the difficult of the question q

– Consequence of this is that for every student, the ranking interms of probability of answering the question correctly is the same.

...

Page 11: Rohan's Masters presentation

Dataset

Validation set Grockit created a validation set which contains responses of 80,075 students on different questions.

Test setTest set was used for ranking the teams, it contains responses of 93,100 users on different questions.

Page 12: Rohan's Masters presentation

Dyadic Prediction

A dyadic prediction task is a learning task which involves predicting a class label for a pair of items ( Hoffman 1999 )

Page 13: Rohan's Masters presentation

Side-Information

Sometimes there is more information in the dataset. They are

1. side-information associated with u

2. side-information associated with i

3. interaction side-information for (u,i)

Page 14: Rohan's Masters presentation

Interpreting the task as a collaborative filtering problem

The dataset contains student responses for various questions.

179,107 students and 6,046 questions

Page 15: Rohan's Masters presentation

....Skipped

Timed out

Page 16: Rohan's Masters presentation

...

Nominal Outcomes● Correct● Incorrect● Timed-Out● Skipped

Page 17: Rohan's Masters presentation

Dyadic Prediction

( , )

..... .....

.....

( , )

Training Set

Page 18: Rohan's Masters presentation

Dyadic Prediction

( , ) ?

Query in Test

Page 19: Rohan's Masters presentation

Side Information in the dataset

Associated with a student

Not Available

Associated with a question

Question Type, Group, Track, Subtrack, Tags

Associated with (student,question) dyad

Game, Number of Players, Started at, Answered at, Deactivated at, Question set

Page 20: Rohan's Masters presentation

Side Information

Question TypeMultiple Choice, Free Response

GroupACT, GMAT, SAT

SubtrackCritical Reasoning, Data Sufficiency, English, Ientifying Sentence Errors, Improving Paragraphs, Improving Sentences, Math, Multiple Choice, Passage Based Reading, Problem Solving, Reading, Reading Comprehension, Science, Sentence Completion, Sentence Correction, Student Produced Response

Tags

describes the skill that is needed to solve the question.

Page 21: Rohan's Masters presentation

Dataset

Page 22: Rohan's Masters presentation

Dataset

Page 23: Rohan's Masters presentation

Dataset

Page 24: Rohan's Masters presentation

....

The dataset is similar to the typical dyadic dataset with a couple of key differences: ● Duplicate Dyads

There can exist duplicate dyad pairs in the training set with different outcomes, since a student can answer a question many times,

● Collaborative or Competitive AnsweringIn some games types, students can collaboratively answer questions.

Page 25: Rohan's Masters presentation

Motivation for Latent feature approach

Highly successful at winning the Netflix prize 1M$ challenge (Toscher et al., 2009) where the problem was to predict ratings for movies.

Page 26: Rohan's Masters presentation

Metric used to rank the teams

Binomial Capped Deviance, similar to log-likelihood

Estimated probability of correct responseCapped between [0.01,.99]

True label of the dyad

Page 27: Rohan's Masters presentation

Leaderboard

Page 28: Rohan's Masters presentation

Latent feature log-linear

Motivations for Latent Feature Log-Linear (LFL) (Menon & Elkan, 2010)

Well calibrated Probabilitieswe need to predict the probability of correct outcome for the dyadic pairs in the test set.

Leverage Side-InformationMost collaborative filtering algorithms do not have any principled way of including side-information

Scale WellTo be used in the industry, the method has to scale well to large datasets

Page 29: Rohan's Masters presentation

Multiclass LFL model

Page 30: Rohan's Masters presentation

Multiclass LFL model

Case | Y| = 3

p(y=3 | (user,item)) = exp( U3user . I

3item )

U1 I1 U1 U1 I2U2 U3 I3

Z = exp( U1user . I

1item ) +exp( U1

user . I1

item ) + exp( U3user . I

3item )

Z

Page 31: Rohan's Masters presentation

Binary LFL on the dataset

Test Set contains only two types of outcomes i) correct ii) incorrect

y = 1 ( Correct Response)

y = 0 ( Incorrect Response)

The binary-LFL model has appeared in the literature before (Schein et al., 2003; Agarwal & Chen, 2009)

Page 32: Rohan's Masters presentation

Training

We optimize for the negative log likelihood

We can optimize this objective function using the stochastic gradient descent method.

Regularization Terms

Page 33: Rohan's Masters presentation

Stochastic Gradient Descent

Page 34: Rohan's Masters presentation

LFL on GrockIt

Page 35: Rohan's Masters presentation

Stochastic Gradient Descent

Page 36: Rohan's Masters presentation

Grid Search

parameters

Page 37: Rohan's Masters presentation

Parallel SGD Training

Was formulated independently by Gemulla et al., 2011

Page 38: Rohan's Masters presentation

KDD CUP, Spring, 2011

This is us!!! =)

Page 39: Rohan's Masters presentation

Parallelism

Page 40: Rohan's Masters presentation

Side-Information

For a question q, let g =group(q). We can add a latent vector for each group i.e ACT, GMAT, SAT

Prediction equation after adding side information is

Page 41: Rohan's Masters presentation

Categorical Features

Group – G

Track – T

Subtrack – ST

Game Type – GT

Question Type – QT

Page 42: Rohan's Masters presentation

LFL Models

Page 43: Rohan's Masters presentation

Training Set

Training set contains four types of outcomes

i) correct, ii) incorrect, iii) skipped and iv) timed-out.

Test set contains four types of outcomes

i) correct, ii) incorrect

We create two training sets,a) Training set with skipped and timed-out responses excluded

b) Training set with skipped and timed-out responses treated as an incorrect outcome

Page 44: Rohan's Masters presentation

Results from LFL Models (a)

Page 45: Rohan's Masters presentation

Results from LFL Models (b)

Page 46: Rohan's Masters presentation

Observation

Throwing away data helps!Removing skipped and timed-out responses from training set improved the BCD (binomial capped deviance)

Motivates for adapting the model to the test-set distribution to win the competition.

Page 47: Rohan's Masters presentation

Ensemble Learning

No Single Model works well on every dyad.Combining predictions from multiple models can outperform each of the individual models (Takcas et al., 2009 )

1M$ Netflix Prize was won by a blend of multiple models

Page 48: Rohan's Masters presentation

Intuition for Ensemble LearningTrue labels for four samples(1,1,0,0)

Predictions from four different models.(0,1,0,0) – accuracy 75%(1,0,0,0) – accuracy 75%(1,1,1,0) – accuracy 75%(1,1,0,1) – accuracy 75%

Average of different models(.75,.75,.25,.25)

Threshold the average at 0.5(1,1,0,0) – accuracy = 100%

Page 49: Rohan's Masters presentation

Using Linear Regression for combining predictions

For a set with known labels,

{ (s,q) – > y(s,q) } , where y can take 0 or 1

pi = pi ( y=1) | (s,q) ) is the estimated probability of a correct response from the ith model

Define matrix P and column matrix Y,

where each row of P contains predictions from n models, ( p1 .., pi , .. pn )

and Y contains the target value y(s,q)

Similarly using predictions for every dyad in the set, we create matrix P with predictions and Y with target values.

We solve,

Pw = Y

Page 50: Rohan's Masters presentation

To predict the probability of a correct response of an example in the test set,

We combine predictions from n models using the weight vector w

pestimated = wj pj

....

Page 51: Rohan's Masters presentation

Which set to use?

Step 1

for each of the n models

Train on the training setPredict on the validation setsave parameters

Step 2: Estimate w using linear regression on the validation set predictions

Step 3:

for each of the n models

Train on the training set + validation setPredict on the test set

Step 4:

Combine predictions of the test set using w

Page 52: Rohan's Masters presentation

Results

Page 53: Rohan's Masters presentation

After combining predictions using linear regression

Page 54: Rohan's Masters presentation

2 weeks later

Page 55: Rohan's Masters presentation

some weeks later..

Page 56: Rohan's Masters presentation

Gradient Boosted Decision Trees

Leverage Side-Information in Ensemble learning

Gradient Boosted Decision Trees (GBDT) (Friedman, 1999) algorithm can be used to combine predictions and side information together.

Popular algorithm

GBDT is a powerful learning algorithm that is widely used (see Li & Xu, 2009, chap. 6)

The core of the algorithm is a decision tree learner

Page 57: Rohan's Masters presentation

Decision Tree

Decision tres can handle both i) Numeric, and ii) categorical variables.

It can also handle missing information.

Page 58: Rohan's Masters presentation

Decision Tree

Prediction ( Y6 + Y7 + Y9 ) / 3

Prediction ( Y1 + Y3 ) / 2 ................... .................

Decision function

Page 59: Rohan's Masters presentation

Gradient Boosting

Select the base learner, and loss function.● Decision Tree as the base learner, and Squared

Loss as the loss function Gradient boosting is an iterative-procedure

● Iteratively fit a base learner on the gradient of the previous iteration

Page 60: Rohan's Masters presentation

Gradient Boosting

We can add the a regularization parameter as follows

Page 61: Rohan's Masters presentation

Side-Information for GBDT

Page 62: Rohan's Masters presentation

Meta-Features

Page 63: Rohan's Masters presentation

Preprocessing Tags

Each question has a set of tags that is associated with it. Some are listed below

Statistics (incl. mean median mode),259

Strengthen Hypothesis,260

Student Produced Response,261

System of Linear Equations,262

Systems of Linear Equations,263

Systems of linear equations and inequalities,264

We manually merge the tags that we feel are very similar.

We cluster the tags into 40 clusters using spectral clustering (Ng et al., 2001) with normalized co-ocurrence of tags as the similarity measure to generate the affinity matrix A.

Page 64: Rohan's Masters presentation

Results from GBDT

● GBDT only improved the bcd marginally.

Page 65: Rohan's Masters presentation

Including Temporal Features

Page 66: Rohan's Masters presentation

...

Page 67: Rohan's Masters presentation

GBDT Results after including temporal features

Page 68: Rohan's Masters presentation

Feb 23, Week, competition end

Page 69: Rohan's Masters presentation

Last day

Combined predictions from GBDT models using linear regression, improved slightly.

Page 70: Rohan's Masters presentation

Last day of competition

Page 71: Rohan's Masters presentation

Final Private set ranks

Page 72: Rohan's Masters presentation

Post competition analysis

Latent feature approach is a good approach for this dataset.

LFL performs really well on the dataset

Code will be available soon @ http:/ / code.google.com/p/ latent-feature-log-linear/

Page 73: Rohan's Masters presentation

Questions

Page 74: Rohan's Masters presentation

References

Agarwal, Deepak and Chen, Bee-Chung. Regression based latent factor models. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pp. 19– 28, New York, NY, USA, 2009. ACM. ISBN 978- 1-60558-495-9.Friedman, Jerome H. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38: 367– 378, 1999.Gemulla, Rainer, Nijkamp, Erik, Haas, Peter J., and Sismanis, Yannis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7.Hofmann, Thomas, Puzicha, Jan, and Jordan Michael I. Learning from dyadic data. In Proceedings of the 1998 conference on Advances in neural information processing systems II, pp. 466– 472, Cambridge, MA, USA, 1999. MIT Press. ISBN 0-262-11245-0.Li, Xiaochun and Xu, Ronghui (eds.). High dimensional data analysis in cancer research. Springer, CA, U.S.A, 2009.Menon, Aditya Krishna and Elkan, Charles. A log linear model with latent features for dyadic predic-tion. In ICDM’10, pp. 364– 373, 2010.Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair. On spectral clustering: Analysis and an algorithm.In Advances in Nueral Information Processing Systems, pp. 849– 856. MIT Press, 2001.

Page 75: Rohan's Masters presentation

References

Rasch, Georg. Estimation of parameters and control of the model for two response categories, 1960.

Schein, Andrew I., Lawrence, Andrew I., Saul, Lawrence K., and Ungar, Lyle H. A generalized linear model for principal component analysis of binary data, 2003.

Takcas, G abor, Pilaszy, Istvan, Nemeth, Bottyan, and Tikk, Domonkos. Scalable ́collaborative filtering approaches for large recommender systems. J. Mach. Learn. Res., 10:623– 656, June 2009. ISSN 1532- 4435.

Tscher, Andreas, Jahrer, Michael, and Bell, Robert M. The bigchaos solution to the netflix grand prize, 2009.