play with kaggle

16

Click here to load reader

Upload: matthieu-scordia

Post on 27-Jun-2015

855 views

Category:

Technology


3 download

DESCRIPTION

Slides de ma présentation du 11 décembre au Meetup de Machine Learning.

TRANSCRIPT

Page 1: Play with Kaggle

Play with

Matthieu Scordia

Page 2: Play with Kaggle

“We're making data science into a sport.”

Kaggle?

Kaggle is a platform for predictive modeling competitions.

Page 3: Play with Kaggle

Let's enter a challenge!

Page 4: Play with Kaggle

The Data

Noteworthy characteristics of the dataset:

● Unique queries: 21,073,569

● Unique urls: 703,484,26

● Unique users: 5,736,333

● Training sessions: 34,573,630

● Test sessions: 797,867

● Clicks in the training data: 64,693,054

Total records in the log: 167,413,039 (=15Go!)

Page 5: Play with Kaggle

Let's shake the data

Page 6: Play with Kaggle

Session 0

Session 1

Session 2

Session 3

Session 4

Logs format.

Page 7: Play with Kaggle

session 5

day 15

user 1

Session metadata

(Url,Domain) ranked Term IDs

Url clicked

time passed

Sessions format.

Page 8: Play with Kaggle

Evaluation.

The URLs are labeled using 3 grades of relevance: {0, 1, 2}

The labeling is done automatically, based on dwell-time.

0 : irrelevant - no clicks and clicks with dwell time < 50 time units.

1 : relevant - clicks with dwell time > 50 and < 399 time units.

2 : highly relevant - clicks with dwell time > 400 time units.

Page 9: Play with Kaggle

How to beat Yandex?

Rank

Clickscount

So we have to sort better than that!

Page 10: Play with Kaggle

Step 1 : Reshape it!

For each user we would estimate his click probability on an url

Page 11: Play with Kaggle

Split your dataset like Yandex did:

- On the last 3 days.

- Only one session by user.

Step 2 : Cross validate

Goal: auto-evaluate our model.

Page 12: Play with Kaggle

Step 3 : Add new features

We add some informations on each user:

- Did he see this url in the past?

- Did he click on it?

- How many times?

- Did he skip it?

- Had he ever click on a rank 9 url in the past?

Page 13: Play with Kaggle

Step 3 : Add new features

So we add click entropy:

Where p(x) is the percentage of clicks on document x among all clicks.

For each query:

Example:

Small click entropy query: yahoo, youtube.

Large click entropy query: photos, jobs.

The thing is, we don't want to re-rank all...

Page 14: Play with Kaggle

Step 4 : the model.

We use logistic regression and random forest.

Goal: Predict the probability of click of an user on an url.

session url features... target

Our training set:

Page 15: Play with Kaggle

The leaderboard.

Page 16: Play with Kaggle

[email protected]

If you want to enter with us in a future challenge:

Thanks !