data science, machine learning, and h2o

Scalable Machine LearningFor Smarter Applications

Agenda

Data Science

Machine Learning

Trees and Power of Algorithmic Methods

Examples using H2O Scalable Machine Learning Engine

Who am I?

Hank RoarkData Scientist & Hacker @ H2O.ai

Lecturer in Systems Thinking, UIUC13 years at John Deere, Research, New Product Development, New High Tech VenturesPreviously at startups and consulting

Physics Georgia TechSystems Design & Management MIT

Data Science

Data Science

Interdisciplinary

Electronic commodity, must speak ‘hacker’

Extract insights from data

Discovery and building knowledge

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Data Science

Jeff Hammerbacher (Facebook, Cloudera)• Identify problem• Instrument data sources• Collect data• Prepare data (integrate, transform, clean,

impute, filter, aggregate)• Build model• Evaluate model• Communicate results

Data Science

Ben Fry (data visualization expert)• Acquire• Parse• Filter• Mine• Represent• Refine• Interact

Agenda

Data Science

Machine Learning



WHAT IS MACHINE LEARNING?

Field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

10

A computer program is said to learn from experience E With regards to some task Tand some performance measure P,

if its performance on T, as measured by P, improves with experience E.

Tom Mitchell, 1998

11

Types of Learning• Supervised Learning

• Inferring function from labeled data• Classification• Regression

• Unsupervised Learning• Finding hidden structure in unlabeled data• Clustering• Anomaly

• Reinforcement Learning• Learning from delayed feedback

Isn’t this just statistics repackaged?

x nature y

Shared goals of data analysis:

Prediction

Information extractionL Breiman

Statistical Analysis

xLinear regression

Logistic regressionCox models

y

Assume some process that creates observed data

Model validation: Yes–no using goodness-of-fit testsResidual examination

L Breiman

Algorithmic Analysis (aka ML)

x Unknown y

Process that creates observed data is unknowable

Model validation: Measured by predictive accuracy L Breiman

Decision treesNeural networks

Why Big Data + Machine Learning

Agenda

Data Science

Machine Learning



Trees

Short exploration of one algorithmic method

Can be used for regression and classification

Segments the prediction space into a number of simple regions

Often referred to as decision trees

Baseball Salary

Salary is color coded from low (blue) to high (red)

Tibshirani and Hastie

Pros and Cons

Simple, thought to mirror human decision making

Not competitive with the best supervised learning approaches in terms of predictive accuracy

Combining large number of trees results in dramatic improvements, with some loss of interpretability

Methods to Improve Predictive Performance of Trees

Bagging Random Forest Boosting

Bagging is short for bootstrap aggregation.

Averaging a set of observations reduces variance.

Individual trees are built on samples, with replacement, of the data. (Bootstrap)

Many trees are built and the results ‘averaged’ (Aggregation)

Random forest builds on bagging, by considering a random subset of the predictors at each tree split

This further decorrelates the trees, resulting in improved predictive performance.

Implemented in H2O as Random Forest.

Builds multiple models sequentially, using information from prior trees.

Slowly fit the residuals of prior models.

Is a general method, not limited to trees.

Implemented in H2O as GBM (Gradient Boosted Models); first ever parallel, distributed GBM.

Which Algorithm Is Best?Linearmodels

Decisiontree

Tibshirani and Hastie

Which Algorithm Is Best?

25

We have dubbed the associated results No Free Lunch theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems. (Wolpert and Macready)

Agenda

Data Science

Machine Learning



• Founded: 2011 venture-backed, debuted in 2012• Product: H2O open source in-memory prediction engine• Team: 37 - Distributed Systems Engineers doing ML• HQ: Mountain View, CA

H2O.ai Overview

H2O.aiMachine Intelligence

25,000 commits / 3yrs

H2O World Conference 2014

Team Work @ H2O.ai

28Join H2O World Nov 9-11 2015!

What is H2O? Open source in-memory prediction engineMath Platform

• Parallelized and distributed algorithms making the most use out of multithreaded systems

• GLM, Random Forest, GBM, Deep Learning, etc.

Easy to use and adoptAPI• Written in Java – perfect for Java Programmers• REST API (JSON) – drives H2O from R, Python, Excel, Tableau

More data? Or better models? BOTHBig Data• Use all of your data – model without down sampling• Run a simple GLM or a more complex GBM to find the best fit for the data• More Data + Better Models = Better Predictions

H2O.aiMachine Intelligence

Accuracy with Speed and Scale

31

Ad Optimization (200% CPA Lift with H2O)

P2B Model Factory (60k models, 15x faster with H2O than before)

Fraud Detection (11% higher accuracy with H2O Deep Learning - saves millions)

…and many large insurance, financial services, and manufacturing companies!

Real-time marketing (H2O is 10x faster than anything else)

Customer Use Cases

Customer Stories• Propensity to Buy model

• AdTech• Fraud prevention

Propensity to Buy modeling factory

Cisco Predictive Modeling Factories

Problem

Why H2O?

Who uses it?

• Need to predict whether a company will buy a certain product at a given time

• Spend a lot of time preparing models• Less time for scoring and less time left for using the

scores in the sales activities

• P2B factory is 15x faster with H2O• Newer buying patterns incorporated immediately

into models• Scores are published sooner

• More time for planning and executing activities• R + H2O is a robust and powerful combination

• Lou Carvalheira, advanced analytics manager• Customer Intelligence data scientists

P2B factory is 15x faster with H2O

Q1 Q2

P2B Training

Scoring models

Data Refresh Q2

Data Refresh Q1

Prepare, execute Mktg & Sales

activities

Before, without H2O

Q1 Q2

Train &

score

Data Refresh

Prepare, execute

Mktg & Sales activities

Train &

score

Data Refresh

Prepare, execute

Mktg & Sales activities

Now, with H2O

Modeling conversion rate on multiple campaigns

ShareThis AdTech Optimization

Problem

Why H2O?

Who uses it?

• ShareThis ONLY targets users within 24 hours to ensure ads reach them at the most relevant moment for maximum ROI

• Maximized ROI by optimizing campaign performance and budget allocation

• Increased accuracy and better anomaly removal

• Reduced R&D time significantly

• Used all data and built models faster, & faster scoring

• Smooth model building pipeline with R and Spark API

• Prasanta Behera, VP of Engineering• Ad Products team

STANDARD TARGETINGTHRESHOLD

INTER

EST

TIME

TRIGGER

EXCITEMENT

PEAK READI-NESSFOR ENGAGEMENT

FADING INTEREST

MALE 25-45 TECH ENTHUSI-

AST $HHI $75K+

“DAN”

ShareThis ONLY targets users within 24 hours to ensure ads reach them at the most relevant moment

SHARETHIS MESSAGING TRIG-

GER

Real Time Messaging Reaches Users DuringPeak Interest

Live Tests on Different Campaignsobserved CPA lift using H2O

Fraud prevention using Deep Learning

PayPal Fraud Prevention

Problem

Why H2O?

Who uses it?

• Flag fraudulent behavior upfront• Monitor account activity and account-to-account

transactions for suspicious behavior and changes• Need to model new and complex attack patterns

quickly

• Fast, scalable, and accurate• Flexible deployment• Works seamlessly with Hadoop• Simple interface• 11% improvement in accuracy w/ Deep Learning

• Fraud Prevention data science team

Fraud Prevention at PayPalExperiment

• Dataset

− 160 million records

− 1500 features (150 categorical)

− 0.6TB compressed in HDFS

• Infrastructure

− 800 node Hadoop (CDH3) cluster

• Decision

− Fraud/not-fraud

• Network architecture- 6 layers with 600 neurons each performed the best

• Activation function − RectifierWithDropout performed the

best

• 11% accuracy Improvement with limited feature set & a deep network− With a third of the original feature set,

6 hidden layers, 600 neurons each

Results

Customer selects song to

purchase

$Payment

information entered

Data collected

Comparison with past consumer behavior

Random ForestDetermine fraud/not

fraud

Take steps to stop fraud or prevent

future fraud

Fraud Prevention with Random Forest

Live Demonstration

Agenda

Data Science

Machine Learning



Thank You

data science, machine learning, and h2o

Software

big data machine learning

data discovery

observed data model

large number of trees

trees short exploration

baseball salary salary

algorithmic analysis

unknowable model validation