featurizing log data before xgboost

Featurizing log data before XGBoost

Xavier ConortThursday, August 20, 2015 @

● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University,

● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260

international courses● high dropout rate

The competition host

http://www.xuetangx.com/

http://www.xuetangx.com/

● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities.

● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows):

■ Columns: enrollment_id, username, course_id○ log_train / log_test

■ Columns: enrollment_id, time, source, event, object○ object

■ Columns: course_id, module_id, category, children, start○ truth_train

■ Columns: enrollment_id, dropped_out

Problem to solve

Log data5890 objects

Team

Chief Product Officer Chief Data Scientist

Data Scientist Data Scientist

(O. Zhang)

How we worked as a Team● worked separately on feature engineering. 90% of

our time was spent here. ● delegated Modeling part to DataRobot to:

○ find best algorithm (with XGboost as a winner!) ○ model text features○ tune hyperparameters○ experiment different feature sets and blend 8 XGBoost

using different sets○ communicate results

Feature engineering techniques used● counts● time statistics (min, mean, max, diff)● entropy● sequences treated as text on which we ran

○ SVD on 3grams○ DataRobot text mining solution

● 20 first components of SVD on user x objectNB: removed duplicated log info and used training + test sets to build most features

How to build efficient features in R

Key course features● course_id● first log time● enrollment counts● unique log counts● mean time interval

Key enrollment count features● log counts● unique log counts● ratio between unique log counts over log counts● unique log counts by event (nagivate, access,

problem, video, page_close, discussion, wiki)● unique log counts before end of course (5 days, 10

days and 30 days before)● sequence number of enrollment in that course

Key enrollment time stats● log time stats (min, mean, max)● gap between first and last log of enrollment● gap between enrollment first log and course first log● gap between enrollment last log and course last logs● difference between mean log time and mid point

between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)

Enrollment entropy features enrollment entropy over● days● weekdays ● fraction (4) of weekdays● hours of the day● hours of the day for the last 1/3/7 days before last

logs● object (when event == problem)● chapter ids

Example of entropy feature- log(weekday_log_count / enrollment_log_count) *

weekday_log_count / enrollment_log_count

Sum => weekday_entropy[enrollment_id==1] 1.589988

Enrollment sequence features ● for each enrollment_id, built sequences of

○ weekdays○ objects

■ all objects / 'problem' and 'video' objects only○ events

● treated sequences as 4 text variables. Ran for each○ svd on 3 grams => first 10 components○ DataRobot stacked predictions from logistic regr.

& Nystroem SVM on (tuned) n-grams

Extract of enrollment object sequences

1/2-grams from Object sequences

DataRobot on Object 1-2 grams

Key user count features and time stats● enrollment count● binary indicator whether user signed up for each of

the 38 courses● unique log count● mean log time interval● sequence number of enrollment for that user

User entropy features user entropy over● days● weekdays ● fraction (4) of weekdays● hours of the day

User sequence features ● for each user, built sequences of

○ weekdays○ chapter_ids○ events

● treated them as 3 text variables. Ran○ SVD on 3 grams => first 10 components○ DataRobot stacked predictions from logistic regr.

+ Nystroem SVM on (tuned) n-grams

How we got to the TOP3● entropy features mentioned before● exploited info in

○ log count in the 5 / 10 / 20 days after end of course○ log counts by event, sign_up counts and day entropy in the next

10 days after end of course○ time to sign up for new course ○ time until the next log for same user

added ~0.001 to AUC (vs less powerful features)

added ~0.002 to AUC

XGBoost

Thank you!

featurizing log data before xgboost

Technology