play with kaggle
DESCRIPTION
Slides de ma présentation du 11 décembre au Meetup de Machine Learning.TRANSCRIPT
![Page 1: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/1.jpg)
Play with
Matthieu Scordia
![Page 2: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/2.jpg)
“We're making data science into a sport.”
Kaggle?
Kaggle is a platform for predictive modeling competitions.
![Page 3: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/3.jpg)
Let's enter a challenge!
![Page 4: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/4.jpg)
The Data
Noteworthy characteristics of the dataset:
● Unique queries: 21,073,569
● Unique urls: 703,484,26
● Unique users: 5,736,333
● Training sessions: 34,573,630
● Test sessions: 797,867
● Clicks in the training data: 64,693,054
Total records in the log: 167,413,039 (=15Go!)
![Page 5: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/5.jpg)
Let's shake the data
![Page 6: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/6.jpg)
Session 0
Session 1
Session 2
Session 3
Session 4
Logs format.
![Page 7: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/7.jpg)
session 5
day 15
user 1
Session metadata
(Url,Domain) ranked Term IDs
Url clicked
time passed
Sessions format.
![Page 8: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/8.jpg)
Evaluation.
The URLs are labeled using 3 grades of relevance: {0, 1, 2}
The labeling is done automatically, based on dwell-time.
0 : irrelevant - no clicks and clicks with dwell time < 50 time units.
1 : relevant - clicks with dwell time > 50 and < 399 time units.
2 : highly relevant - clicks with dwell time > 400 time units.
![Page 9: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/9.jpg)
How to beat Yandex?
Rank
Clickscount
So we have to sort better than that!
![Page 10: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/10.jpg)
Step 1 : Reshape it!
For each user we would estimate his click probability on an url
![Page 11: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/11.jpg)
Split your dataset like Yandex did:
- On the last 3 days.
- Only one session by user.
Step 2 : Cross validate
Goal: auto-evaluate our model.
![Page 12: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/12.jpg)
Step 3 : Add new features
We add some informations on each user:
- Did he see this url in the past?
- Did he click on it?
- How many times?
- Did he skip it?
- Had he ever click on a rank 9 url in the past?
![Page 13: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/13.jpg)
Step 3 : Add new features
So we add click entropy:
Where p(x) is the percentage of clicks on document x among all clicks.
For each query:
Example:
Small click entropy query: yahoo, youtube.
Large click entropy query: photos, jobs.
The thing is, we don't want to re-rank all...
![Page 14: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/14.jpg)
Step 4 : the model.
We use logistic regression and random forest.
Goal: Predict the probability of click of an user on an url.
session url features... target
Our training set:
![Page 15: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/15.jpg)
The leaderboard.
![Page 16: Play with Kaggle](https://reader038.vdocuments.us/reader038/viewer/2022100602/558ec3031a28aba2468b458d/html5/thumbnails/16.jpg)
If you want to enter with us in a future challenge:
Thanks !