More Data vs. Better Models
Really?
Anand Rajaraman: Former Stanford Prof. & Senior VP at Walmart
Sometimes, it’s not about more data
Norvig: “Google does not have better Algorithms only more Data”
Many features/low-bias models
Sometimes, it’s not about more data
You Might not need all your “big Data”
Sometimes you do needa Complex Model
It pays off to be smart aboutHyperparameters
Supervised vs. plus Unsupervised Learning
Everything is an ensemble
The output of your modelwill be the input of another one
(and other system design problems)
The pains & gains of Feature Engineering
Implicit signals beat explicit ones
(almost always)
be thoughtful about your Training Data
Your Model will learn what you teach it to learn
Learn to deal with Presentation Bias
More likely to see
Less likely
Data and Models are great. You know what’s even better?
The right evaluation approach!
You don’t need to distribute your ML algorithm
but, If you do, you should understand at what level to do it
The three levels of Distribution/Parallelization
● For each subset of the population (e.g. region)● For each combination of the hyperparameters● For each subset of the training data
Each level has different requirements
ANN Training over distributed GPU’s
some things are better done Online and others offline… and, there is Nearline for
everything in between
System Overview
● Blueprint for multiple personalization algorithm services● Ranking● Row selection● Ratings● …
● Recommendation involving multi-layered Machine Learning
Matrix Factorization Example
The two faces of yourML infrastructure
Why you should care about answering questions (about your model)
The untold story of Data Science and vs. ML engineering