claudia perlich, chief scientist, dstillery at mlconf nyc
TRANSCRIPT
![Page 1: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/1.jpg)
All the data and still not enough!
Claudia Perlich Chief Scientist
@claudia_perlich
![Page 2: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/2.jpg)
Predictive Modeling: Algorithms that Learn Functions
![Page 3: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/3.jpg)
Income Age Buy
123,000 30 yes
51,100 40 yes
68,000 55 no
74,000 46 no
23,000 47 yes
100,000 49 no
Data for Predictive Modeling
Target E
xam
ple
s Features
![Page 4: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/4.jpg)
?
yes
yes
no
no
yes
no
Rules for Predictive Modeling Target
Exam
ple
s
Features
Data should be:
Large enough
Independently Identically Distributed
![Page 5: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/5.jpg)
Paradox of Big Data: “You never have the data you want”
Art of making due with second best
![Page 6: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/6.jpg)
IBM: Sales Force Optimization
![Page 7: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/7.jpg)
Wallet is NEVER observed We observe
this in the
data
But we do not
observe this
IBM Sales to
this Company
Company Revenue (D&B)
Wallet/Opportunity
How can we make this a
predictive modeling problem?
![Page 8: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/8.jpg)
Wallet
10
5
31
17
39
4
Data for Wallet Estimation?
Target
Exam
ple
s
Features
![Page 9: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/9.jpg)
9
REALISTIC Wallets as quantiles Motivation Imagine 100 identical firms with identical IT needs
Consider the distribution of the IBM sales to these firms
Bottom firms should spend as much as the top
Define wallet as high percentile of spending conditional on the customer attributes
Fre
qu
en
cy
IBM Sales
Wallet Estimate
![Page 10: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/10.jpg)
Revenue
10
5
31
17
39
4
Data for Wallet Estimation
Target
Exam
ple
s
Features
![Page 11: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/11.jpg)
Quantile Regression optimizing weighted absolute loss
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C 1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C 1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C1
C1
C2
![Page 12: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/12.jpg)
Medical Diagnosis: Brest Cancer
![Page 13: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/13.jpg)
© IBM Corporation 2008 Slide 13
Siemens: Computer-Aided Detection of Breast Cancer in Mammograms
1712 Patients 6816 Images
105,000 Candidates
[ x1 , x2 , … , x117 ] Image feature vector
Malignant
?
MLO CC MLO CC
![Page 14: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/14.jpg)
Siemens Medical: fMRI breast cancer data
245 Patients:
36% Cancer
414 Patients:
1% Cancer
1027 Patients
0% Cancer
18 Patients:
85% Cancer
Mo
de
l
sc
ore
Log of Patient ID
Every point
is a candidate
In essence, the most predictive variable is the patient ID
![Page 15: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/15.jpg)
Data for Diagnosis from Multiple Sources
Target
Exam
ple
s
Features
Cancer
yes
no
no
no
no
no
![Page 16: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/16.jpg)
Modeling the Sources …
Target
Exam
ple
s
Features
Source Cancer
1 yes
2 no
1 no
1 no
4 no
3 no
![Page 17: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/17.jpg)
Digital Advertising
![Page 18: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/18.jpg)
Online Display Advertising
Do people buy stuff after seeing an ad?
![Page 19: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/19.jpg)
Data collection for post-view purchase conversion
Time Cohort of random
prospects
?
![Page 20: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/20.jpg)
Data For Advertising
Target
Exam
ple
s
Features
PV Buy
no
no
no
no
yes
yes
![Page 21: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/21.jpg)
Multi-Armed Bandit: Exploration vs. exploitation
Show some random ads to learn a good model
Tradeoff between learning and using
![Page 22: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/22.jpg)
Size of the Training Sample?
Target
Exam
ple
s
Features
Buy
no
no
no
no
yes
yes
![Page 23: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/23.jpg)
Very few Luxury cars are bough online
Maserati $128,0000
$128,0000
![Page 24: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/24.jpg)
Reality of Online Purchases
Target
Exam
ple
s
Features
Buy
no
no
no
no
no
yes
![Page 25: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/25.jpg)
Online Display Advertising
Proxy for purchase? How about click?
![Page 26: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/26.jpg)
Click?
yes
yes
no
no
yes
no
Optimizing Clicks in Advertising?
![Page 27: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/27.jpg)
Click Optimization: Fumbling in the Dark Top 10 Apps by CTR
![Page 28: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/28.jpg)
How Big Data and Optimization is killing Metrics
90% of clicks are ‘accidental/non intentional’
10% are meaningful, and changes can be measures
Optimization can find structure in the other 90%
You will end up with only non-intentional …
![Page 29: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/29.jpg)
Online Display Advertising
Who cares about the ad anyway?
![Page 30: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/30.jpg)
Predict Other indicators: search or brand site visit/schedule test drive
Target E
xa
mp
les
Features
Site Visit
no
no
no
yes
yes
yes
![Page 31: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/31.jpg)
Advertising Fraud
![Page 32: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/32.jpg)
Is there really a person on the other end wanting to see the site?
![Page 33: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/33.jpg)
Data for Fraud Detection
Target
Exam
ple
s
Features
Human?
yes
no
no
yes
yes
no
![Page 34: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/34.jpg)
Telling the difference between an algorithm and a human
Turing test KAPTCHA
![Page 35: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/35.jpg)
Bot traffic networks
![Page 36: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/36.jpg)
Online Display Advertising
Who should you really advertise to???
![Page 37: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/37.jpg)
Data for Advertising Impact
Target
Exam
ple
s
Features
Impact
1
0.3
0.5
0
0
0.1
![Page 38: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/38.jpg)
Alternative Histories (Counterfactual)
![Page 39: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/39.jpg)
Fundamentally Impossible!
Target
Exam
ple
s
Features
Impact
1
0.3
0.5
0
0
0.1
![Page 40: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/40.jpg)
Build two separate models and calculate impact as the difference
Site Visit
yes
no
no
yes
no
no
Site Visit
yes
no
no
yes
no
no
Exam
ple
s 1
se
en
ad
Exam
ple
s 2
not se
en
ad
Expected Impact: p(SV|Ad)-p(SV|no ad)
![Page 41: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/41.jpg)
Use predictive models to measure impact
Negative Test: wrong ad
Positive Test: A/B comparison
![Page 42: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/42.jpg)
Relationship of organic conversion rate and causal impact
-0.001000
0.000000
0.001000
0.002000
0.003000
0.004000
0.005000
0.006000
0.40% 0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40%
Organic conversion propensity
Additiv
e c
asual im
pact
![Page 43: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/43.jpg)
Audiences in Video Advertising
![Page 44: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/44.jpg)
![Page 45: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/45.jpg)
Pleasing the advertising oracle …
Audience reports from matched populations in Facebook
68% of the ads where shown to females
Makeup for 32% of ads The Oracle
![Page 46: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/46.jpg)
Data for Audience Optimization
Target E
xa
mp
les
Features
Gender
male
female
female
male
male
female
![Page 47: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/47.jpg)
Weighted Logistic Regression on aggregated
Target E
xa
mp
les
Features
Weight Gender
0.32 male
0.68 female
0.32 male
0.68 female
0.73 male
0.27 female
![Page 48: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/48.jpg)
Hyperlocal Targeting?
Foursquare locations: very noisy…
![Page 49: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/49.jpg)
Data for Location Reliability in Auction
Target E
xa
mp
les
Features
Reliable?
yes
no
no
yes
yes
no
![Page 50: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/50.jpg)
30% smart phone users travel faster than speed of sound …
![Page 51: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/51.jpg)
Catalan traditions pop up everywhere ….
![Page 52: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/52.jpg)
Data for Location Reliability in Auction
Target
Exa
mple
s
Features
Reliable?
maybe
no
no
maybe
maybe
no
![Page 53: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/53.jpg)
Paradox of Big Data: “You never have the data you want”
Art of making due with second best
![Page 54: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC](https://reader030.vdocuments.us/reader030/viewer/2022032421/55a6883b1a28ab341e8b46c1/html5/thumbnails/54.jpg)
All a matter how creative you are at cheating….