featurizing log data before xgboost

23
Featurizing log data before XGBoost Xavier Conort Thursday, August 20, 2015 @

Upload: datarobot

Post on 06-Jan-2017

22.846 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Featurizing log data before XGBoost

Featurizing log data before XGBoost

Xavier ConortThursday, August 20, 2015 @

Page 2: Featurizing log data before XGBoost

● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University,

● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260

international courses● high dropout rate

The competition host

Page 3: Featurizing log data before XGBoost

● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities.

● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows):

■ Columns: enrollment_id, username, course_id○ log_train / log_test

■ Columns: enrollment_id, time, source, event, object○ object

■ Columns: course_id, module_id, category, children, start○ truth_train

■ Columns: enrollment_id, dropped_out

Problem to solve

Page 4: Featurizing log data before XGBoost

Log data5890 objects

Page 5: Featurizing log data before XGBoost

Team

Chief Product Officer Chief Data Scientist

Data Scientist Data Scientist

(O. Zhang)

Page 6: Featurizing log data before XGBoost

How we worked as a Team● worked separately on feature engineering. 90% of

our time was spent here. ● delegated Modeling part to DataRobot to:

○ find best algorithm (with XGboost as a winner!) ○ model text features○ tune hyperparameters○ experiment different feature sets and blend 8 XGBoost

using different sets○ communicate results

Page 7: Featurizing log data before XGBoost

Feature engineering techniques used● counts● time statistics (min, mean, max, diff)● entropy● sequences treated as text on which we ran

○ SVD on 3grams○ DataRobot text mining solution

● 20 first components of SVD on user x objectNB: removed duplicated log info and used training + test sets to build most features

Page 8: Featurizing log data before XGBoost

How to build efficient features in R

Page 9: Featurizing log data before XGBoost

Key course features● course_id● first log time● enrollment counts● unique log counts● mean time interval

Page 10: Featurizing log data before XGBoost

Key enrollment count features● log counts● unique log counts● ratio between unique log counts over log counts● unique log counts by event (nagivate, access,

problem, video, page_close, discussion, wiki)● unique log counts before end of course (5 days, 10

days and 30 days before)● sequence number of enrollment in that course

Page 11: Featurizing log data before XGBoost

Key enrollment time stats● log time stats (min, mean, max)● gap between first and last log of enrollment● gap between enrollment first log and course first log● gap between enrollment last log and course last logs● difference between mean log time and mid point

between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)

Page 12: Featurizing log data before XGBoost

Enrollment entropy features enrollment entropy over● days● weekdays ● fraction (4) of weekdays● hours of the day● hours of the day for the last 1/3/7 days before last

logs● object (when event == problem)● chapter ids

Page 13: Featurizing log data before XGBoost

Example of entropy feature- log(weekday_log_count / enrollment_log_count) *

weekday_log_count / enrollment_log_count

Sum => weekday_entropy[enrollment_id==1] 1.589988

Page 14: Featurizing log data before XGBoost

Enrollment sequence features ● for each enrollment_id, built sequences of

○ weekdays○ objects

■ all objects / 'problem' and 'video' objects only○ events

● treated sequences as 4 text variables. Ran for each○ svd on 3 grams => first 10 components○ DataRobot stacked predictions from logistic regr.

& Nystroem SVM on (tuned) n-grams

Page 15: Featurizing log data before XGBoost

Extract of enrollment object sequences

Page 16: Featurizing log data before XGBoost

1/2-grams from Object sequences

Page 17: Featurizing log data before XGBoost

DataRobot on Object 1-2 grams

Page 18: Featurizing log data before XGBoost

Key user count features and time stats● enrollment count● binary indicator whether user signed up for each of

the 38 courses● unique log count● mean log time interval● sequence number of enrollment for that user

Page 19: Featurizing log data before XGBoost

User entropy features user entropy over● days● weekdays ● fraction (4) of weekdays● hours of the day

Page 20: Featurizing log data before XGBoost

User sequence features ● for each user, built sequences of

○ weekdays○ chapter_ids○ events

● treated them as 3 text variables. Ran○ SVD on 3 grams => first 10 components○ DataRobot stacked predictions from logistic regr.

+ Nystroem SVM on (tuned) n-grams

Page 21: Featurizing log data before XGBoost

How we got to the TOP3● entropy features mentioned before● exploited info in

○ log count in the 5 / 10 / 20 days after end of course○ log counts by event, sign_up counts and day entropy in the next

10 days after end of course○ time to sign up for new course ○ time until the next log for same user

added ~0.001 to AUC (vs less powerful features)

added ~0.002 to AUC

Page 22: Featurizing log data before XGBoost

XGBoost

Page 23: Featurizing log data before XGBoost

Thank you!