infovision 2011 data to decisions shailesh kumar, google
DESCRIPTION
Infovision 2011 Data to Decisions Shailesh Kumar, Google http://informationexcellence.wordpress.com/category/knowledge-share-sessions/ Infovision 2011 Data to Decisions Shailesh Kumar, Google http://informationexcellence.wordpress.com/2011/10/28/infovision2011-presentations/TRANSCRIPT
![Page 1: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/1.jpg)
From Data to Decisions: Learnings from Real-World
Data Mining
Dr. Shailesh Kumar Google, Inc.
InfoVision 2011
![Page 2: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/2.jpg)
Welcome to the Information Age … … drowning in data and starving for Knowledge
ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTATTACGCGACCGTAAGCTAC…
![Page 3: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/3.jpg)
This data explosion is enabled by…
Better “Sensors” – Higher Resolution, More Spectral Bands, Quick Experimental Turnaround, Crowd Sourcing…
Higher Bandwidth Communication – Faster Networks and Routers, Better Compression technologies…
Larger Warehouses – Cheaper Storage, Multi-Level Caching, Scalable Database/Data warehousing technologies…
Massive Crunching Power – Faster Multi-core processors, Parallel Distributed Computing, MapReduce paradigms…
Advances in Machine Learning and Data Mining –Sophisticated Learning frameworks, Distributed Data Mining…
![Page 4: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/4.jpg)
From “Data” to “Decision”
Insights Features
Models
Predictions
Domain Knowledge
Business Objectives Business Constraints
Feedback
Data
Decision
![Page 5: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/5.jpg)
Observation Prediction Decision
Credit Card Fraud Input: Past card usage behavior Predict: Fraudulent transaction?
Credit Scoring Input: Past payment behavior Predict: Probability of Default
Retail Cross Sell Input: Past purchase behavior
Predict: Response to a coupon Approve Transaction? Approve Loan? Send Coupon?
![Page 6: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/6.jpg)
Building Machine Learning Models The Process, the Art, and the Science
Collect Raw (Input) Data
Collect Target (Output) Labels (“ground truth”)
Choose: “Model Type” & “Model Complexity”
Engineer and Select “Predictive” features
“Train” a model using Feature-Label training data set
“Evaluate” the trained model on “validation” data and iterate until satisfied
Can be Costly!!
Too Simple: Under-Learn Too Complex: Over-Learn
Bias Variance Tradeoff
“Deploy” the model: Predict class label of all the “un-labeled” data
• Use Domain Knowledge • Keep variability that matters • Remove Redundancy
![Page 7: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/7.jpg)
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
![Page 8: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/8.jpg)
Looking for a Needle in a Haystack?
What is the nature of my haystack (data) What process generated the data? What assumptions am I making about the data?
Is it the right needle (insight) to look for? Is it “actionable”? Is it “useful”? Is it “novel”? Does it tell me something I didn’t know?
Insight Discovery ≠ Hypothesis Testing
![Page 9: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/9.jpg)
The Traditional Market Basket Analysis Wrong needle in a mysterious haystack!
FREQUENT ITEM-SETS
Size = 1
CANDIDATE ITEM-SETS
Size = 2
FREQUENT ITEM-SETS
Size = 2
CANDIDATE ITEM-SETS
Size = 3
FREQUENT ITEM-SETS
Size = 3
![Page 10: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/10.jpg)
Lesson: Know your data (Haystack) What process generated the data?
mixture of, projections of, latent intentions
already have other products
buy them from another retailer
buy them at a different time
got them as gifts
….
Few buy a complete “logical” product group in the same basket
![Page 11: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/11.jpg)
Lesson: Extract the essence, let go of data Pair-wise Co-occurrence Statistics
![Page 12: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/12.jpg)
Lesson: Look for the right Insight “Frequent” vs. “Logical” Itemset
Novel – Not obvious from the data (support = 0) Useful – product bundling, recommendations, layout Exhaustive – “No insight left behind!” – however “rare”
Airbeds Lighting Folding Furniture
Camping Accessories
Grill Accessories
Inflatables
Water Sports Lighting
Patio Accessories
Furniture
Projection TV Flat Panel TV
Home Theatre Services
Digital Cable TV Home Components
Speakers
![Page 13: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/13.jpg)
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
![Page 14: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/14.jpg)
Two Mindsets to Modeling
Model-Centric • Throw all features in! • Have enough data • Build Complex models
Feature-centric • Carefully craft features • Use Domain Knowledge • Build Simpler Models
Simple Features
Complex Model
Complex Features
Simple Model
The Law of Conservation of Complexity
![Page 15: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/15.jpg)
Lesson: Distribute Complexity well Simplify Models with complex features
Simple Features
Complex Model
Complex Features
Simple Model
![Page 16: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/16.jpg)
Lesson: Overcome model limitations
Age < 60
Income < Rs. 32
Education < 20
Inco
me
Age
Education < 20
log (Income) - B x Age < 12
log
(Inco
me)
Age
?
![Page 17: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/17.jpg)
Lessons from Real-world Data Mining
Insights Text
Features
Labels
Models
Decisions
![Page 18: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/18.jpg)
Lesson: Things are not what they appear What is a word in “Bag-of-Words”?
Segmentation: What is a word? New York Stock Exchange 4 words? “New York” “Stock Exchange” 2 phrases? “New York Stock Exchange” 1 phrase?
Disambiguation: What does a word mean? ‘rock band’, ‘rock climbing’, ‘rocking chair’, ‘the rock’
Equivalencing: How “similar” are two terms? Comparing Apples to Oranges… Orange Juice, Orange Flag, Orange Blog, Apple store, Apple pie, The Big Apple
![Page 19: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/19.jpg)
Equivalencing we filed a suit charging dell of illegal behavior they submitted a case accusing apple of unauthorized conduct
Disambiguation i was right to avoid a suit against apple on my right was a man in a suit drinking apple juice
You shall know a word by the company it keeps -- Firth, J. R. 1957:11
SIMILARITY = 0.995
SIMILARITY = 0.171
![Page 20: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/20.jpg)
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
![Page 21: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/21.jpg)
Labels are precious – use them well Labeled data vs. Unlabeled data
Lots of input data! (e.g. web pages) Small fraction is labeled! (e.g. spam/not)
Labels can be Costly – human judgments, costly experiments, rare events Noisy – web clicks, crowd sourced,…
How do we use unlabeled data with labeled data? Semi-supervised Learning
Which unlabeled data point to get labeled next? Active Learning
![Page 22: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/22.jpg)
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
![Page 23: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/23.jpg)
Lesson: Don’t beat data into submission Model Complexity no more than necessary
How many hidden units in a neural network? How deep a decision tree? How much cost for “misclassification elasticity” in SVM? How many clusters? or modes in mixture of density?
Model is too simple under-learn
Model is too complex memorize
Model is just right generalize
![Page 24: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/24.jpg)
Lesson: Divide and Conquer Many simple models > Single complex model
M W N U F P
V Y S Z B E I J
A
K R
H Q O G
L D
T
X C
• Better “localized features” • Simpler “local models” • More interpretable features and models • Higher Accuracy • Faster Modeling Time • Lower Resource Requirements
![Page 25: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/25.jpg)
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
![Page 26: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/26.jpg)
Lesson: Interpret Predictions What is the score? Why is score that way?
Concept Space Prediction Score Overlay
*This is not what we mean by the “art of data mining”
![Page 27: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/27.jpg)
Lesson: Learn Globally, Decide Locally
“The Ford-Firestone dispute blew up in August 2000 and is still going strong. In response to claims that their 15-inch Wilderness AT, radial ATX and ATX II tire treads were separating from the tire core leading to grisly, spectacular crashes. Bridgestone/Firestone recalled 6.5 million tires….” -- Forbes
Accidents description Density Overlay
![Page 28: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/28.jpg)
Lesson: Prediction is not enough! Different Reasons, Different Decisions
Probability of defaulting Collection Notes
![Page 29: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/29.jpg)
Summary Decisions driven more by data than by “gut feeling”
Converting data to decisions is Art + Science + Engineering
Insights: Right needles in a well understood Haystack
Features: Garbage In, Garbage Out
Models: Generalize, don’t Memorize
Labels: Explore thoroughly, Exploit efficiently
Decisions: Right decision for the right reason
Feedback: Adapt features, models, scores, decisions
![Page 30: Infovision 2011 Data to Decisions Shailesh Kumar, Google](https://reader033.vdocuments.us/reader033/viewer/2022052600/5587a5c4d8b42af9678b45d3/html5/thumbnails/30.jpg)
In theory, theory and practice are same.
In practice, they are not.
-- Lawrence Peter Berra
Questions?