intro to data science for non-data scientists

48
H 2 O.ai Machine Intelligence Data Science for Non-Data Scientists Erin LeDell Ph.D. Silicon Valley Big Data Science August 2015

Upload: srisatish-ambati

Post on 24-Jan-2017

2.683 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

Data Science for Non-Data Scientists

Erin LeDell Ph.D.

Silicon Valley Big Data Science August 2015

Page 2: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

H2O.ai

H2O Company

H2O Software

• Team: 35. Founded in 2012, Mountain View, CA• Stanford Math & Systems Engineers

• Open Source Software • Ease of Use via Web Interface• R, Python, Scala, Spark & Hadoop Interfaces• Distributed Algorithms Scale to Big Data

Page 3: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

Scientific Advisory CouncilDr. Trevor Hastie

Dr. Rob Tibshirani

Dr. Stephen Boyd

• John A. Overdeck Professor of Mathematics, Stanford University• PhD in Statistics, Stanford University• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Co-author with John Chambers, Statistical Models in S• Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar)

• Professor of Statistics and Health Research and Policy, Stanford University• PhD in Statistics, Stanford University• COPPS Presidents’ Award recipient• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Author, Regression Shrinkage and Selection via the Lasso• Co-author, An Introduction to the Bootstrap

• Professor of Electrical Engineering and Computer Science, Stanford University• PhD in Electrical Engineering and Computer Science, UC Berkeley• Co-author, Convex Optimization• Co-author, Linear Matrix Inequalities in System and Control Theory• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction

Method of Multipliers

Page 4: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

What is Data Science?

Problem Formulation

• Identify an outcome of interest and the type of task: classification / regression / clustering

• Identify the potential predictor variables• Identify the independent sampling units

• Conduct research experiment (e.g. Clinical Trial)• Collect examples / randomly sample the population• Transform, clean, impute, filter, aggregate data• Prepare the data for machine learning — X, Y

• Modeling using a machine learning algorithm (training)• Model evaluation and comparison• Sensitivity & Cost Analysis

• Translate results into action items• Feed results into research pipeline

Collect & Process Data

Machine Learning

Insights & Action

Page 5: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence Source: marketingdistillery.com

Page 6: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

What is Machine Learning?

What it is: ✤ “Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel, 1959)

✤ “Machine learning and statistics are closely related fields. The ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics.” (Jordan, 2014)

✤ M.I. Jordan also suggested the term data science as a placeholder to call the overall field.

Unlike rules-based systems which require a human expert to hard-code domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone.

What it’s not:

Page 7: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

Classification

Clustering

Machine Learning Overview

• Predict a real-valued response (viral load, weight)• Gaussian, Gamma, Poisson and Tweedie • MSE and R^2

• Multi-class or Binary classification• Ranking• Accuracy and AUC

• Unsupervised learning (no training labels)• Partition the data / identify clusters• AIC and BIC

Regression

Page 8: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

Machine Learning Workflow

Source: NLTK

Example of a supervised machine learning workflow.

Page 9: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

ML Model Performance

Test & Train • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30)

• Train a model using the “training set” and evaluate performance on the “test set” or “validation set.”

• Train & test K models as shown.

• Average the model performance over the K test sets.

• Report cross-validated metrics.

• Regression: R^2, MSE, RMSE• Classification: Accuracy, F1, H-measure• Ranking (Binary Outcome): AUC, Partial AUC

K-foldCross-validation

Performance Metrics

Page 10: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

What is Deep Learning?

What it is: ✤ “A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015)

✤ Deep neural networks have more than one hidden layer in their architecture. That’s what’s “deep.”

✤ Very useful for complex input data such as images, video, audio.

Deep learning architectures, specifically artificial neural networks (ANNs) have been around since 1980, so they are not new. However, there were breakthroughs in training techniques that lead to their recent resurgence (mid 2000’s). Combined with modern computing power, they are quite effective.

What it’s not:

Page 11: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

Deep Learning Architecture

Example of a deep neural net architecture.

Page 12: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

What is Ensemble Learning?

What it is: ✤ “Ensemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms.” (Wikipedia, 2015)

✤ Random Forests and Gradient Boosting Machines (GBM) are both ensembles of decision trees.

✤ Stacking, or Super Learning, is technique for combining various learners into a single, powerful learner using a second-level metalearning algorithm.

Ensembles typically achieve superior model performance over singular methods. However, this comes at a price — computation time.

What it’s not:

Page 13: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

Where to learn more?

• H2O Online Training (free): http://learn.h2o.ai• H2O Slidedecks: http://www.slideshare.net/0xdata• H2O Video Presentations: https://www.youtube.com/user/0xdata• H2O Community Events & Meetups: http://h2o.ai/events• Machine Learning & Data Science courses: http://coursebuffet.com

Page 14: Intro to Data Science for Non-Data Scientists

Customers ! Community ! Evangelists !

November 9, 10, 11 Computer History Museum

H 2 O W O R L D . H 2 O . A I

!

20% off registration using code:

h2ocommunity !

Page 15: Intro to Data Science for Non-Data Scientists

H2O.ai Machine Intelligence

Questions?

@ledell on Twitter, GitHub [email protected]

http://www.stat.berkeley.edu/~ledell

Page 16: Intro to Data Science for Non-Data Scientists

Data Science for Non-Data Scientists

aka. How the Business Views Data Science

Chen HuangAugust 20, 2015

Page 17: Intro to Data Science for Non-Data Scientists
Page 18: Intro to Data Science for Non-Data Scientists

Agenda

•  Introduction•  Data Science Primer•  Working with Data Scientists•  Decoding the Data Science Lingo•  Q&A

Page 19: Intro to Data Science for Non-Data Scientists

Introduction

•  Who am I? •  Why am I giving this talk?

Page 20: Intro to Data Science for Non-Data Scientists

Who am I?•  Data Strategist•  Career in Business Intelligence,

Analytics, and Big Data•  Various roles

•  Consultant•  Developer•  Business and Data Analyst•  Product Manager•  Functional and Technical Trainer •  Client Services

•  Worked in various industries•  Health care, pharmaceutics,

communications and high tech, consumer products, automotive, finance, government contracting

August, 2015 – San Francisco, CA

Page 21: Intro to Data Science for Non-Data Scientists

Why am I giving this talk?

July, 2011 – Beijing, China

Page 22: Intro to Data Science for Non-Data Scientists

Data Science Primer

•  What can Data Science do for the Business?•  Applications of Data Science •  Data-Driven Decisions•  What does a Data Scientist do?•  Data Science Skills

Page 23: Intro to Data Science for Non-Data Scientists

What can Data Science do for the Business?

A: Data science! Extracting useful information and knowledge from large volumes of data in order to improve business decision-making or providing the business insights to make data-driven decisions

Data Business

Page 24: Intro to Data Science for Non-Data Scientists

What can Data do?

Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science

Page 25: Intro to Data Science for Non-Data Scientists

Applications of Data Science

Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science

Page 26: Intro to Data Science for Non-Data Scientists

Data-Driven Decisions

•  Practice of basing decisions on data, rather than purely on intuition

•  There is evidence that data-driven decision making and big data technologies substantially improve business performance

Page 27: Intro to Data Science for Non-Data Scientists

The Art and Science of Data Science

•  Discover unknowns in data•  Obtain predictive, actionable insights•  Communicate business data stories•  Build confidence in decision making•  Create valuable Data Products that has business

impacts

http://www.slideshare.net/datasciencelondon/big-data-sorry-data-science-what-does-a-data-scientist-do

Page 28: Intro to Data Science for Non-Data Scientists
Page 29: Intro to Data Science for Non-Data Scientists

What does a Data Scientist do?

•  Data curiosity. Explore data. Discover unknowns•  Understand data relationships •  Understand the business, has domain knowledge•  Can tell relevant stories with data•  Holistic view of the business•  Knows machine learning, statistics, probability•  Can hack and code•  Define and test an hypothesis, run experiences•  Asks good questions

http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science

Page 30: Intro to Data Science for Non-Data Scientists

Data Science Skills

Image: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 31: Intro to Data Science for Non-Data Scientists

Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

Page 32: Intro to Data Science for Non-Data Scientists

Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

Page 33: Intro to Data Science for Non-Data Scientists

Working with Data Scientists

•  Collaboration•  Data Science Cycle•  Organizational Models for Data Science Teams

Page 34: Intro to Data Science for Non-Data Scientists
Page 35: Intro to Data Science for Non-Data Scientists

Working with Data Scientists

Data ScienceBusiness

Data Engineering

Page 36: Intro to Data Science for Non-Data Scientists

Data Science Cycle

Image: https://en.wikipedia.org/wiki/Data_science

Page 37: Intro to Data Science for Non-Data Scientists

Organizational Models for Data Science Teams

Image: http://www.slideshare.net/emcacademics/building-data-science-teams-31057129

Page 38: Intro to Data Science for Non-Data Scientists

Decoding the Data Science Lingo

Page 39: Intro to Data Science for Non-Data Scientists

Machine Learning

•  A subfield of computer science and artificial intelligence (AI) that focuses on the design of systems that can learn from and make decisions and predictions based on data.

•  Machine learning enables computers to act and make data-driven decisions rather than being explicitly programmed to carry out a certain task.

•  Machine Learning programs are also designed to learn and improve over time when exposed to new data.

•  Everything!Data Science Definition: Business Application:

Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple

Page 40: Intro to Data Science for Non-Data Scientists

Unsupervised Learning

Data Science Definition:•  Where a program, given a

dataset, can automatically find patterns and relationships within the dataset.

•  The business will decide how deeply or many categories there are.

•  Clustering or grouping of like data.

•  Examples: k-means clustering, hierarchical clustering

Business Application:•  Customer segmentation•  Understanding users and

behaviors•  Classifying unknown and pre-

defined images into categories

Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple

Page 41: Intro to Data Science for Non-Data Scientists

Supervised Learning

•  Where a program is “trained” on a pre-defined dataset.

•  Based off its training data the program can make accurate decisions when given new data.

•  Classifying Twitter sentiments•  Recommender systems

Data Science Definition: Business Application:

Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple

Page 42: Intro to Data Science for Non-Data Scientists

Score

•  Number of ways to evaluate how well the model assigns the correct class value to the test instances.

•  Confidence gauge Data Science Definition: Business Application:

Definition: https://mlcorner.wordpress.com/tag/scoring/

Page 43: Intro to Data Science for Non-Data Scientists

Score Cont.•  True Positive (TP):    If the instance

is positive and it is classified as positive False

•  Negative (FN): If the instance is positive but it is classified as negative True

•  Negative (TN):  If the instance is negative and it is classified as negative False

•  Positive (FP):   If the instance is negative but it is classified as positive

•  Classification problems:•  Precision = the number of times you correctly classify = TP/(TP+FP)•  Accuracy = proportion of correctly classified instances = (TP+TN)/(TP+TN

+FP+FN)•  Recall or Sensitivity = the number of positive that you correctly classify out

of all the actual positives = TP/(TP+FN)•  Specificity = classifier’s ability to identify negative results = TN/(TN+FP)

Page 44: Intro to Data Science for Non-Data Scientists

Classification

•  Sub-category of Supervised Learning

•  Classification is the process of taking some sort of input and assign a label to it. The predictions are discrete, categories, or “yes or no” nature.

•  Examples: Logistic Regression, Random Forest

•  What customers should a company target with its marketing campaigns?

•  Is this Nigerian prince committing fraud? (Spam classification)

•  Is this actually Barack Obama’s Facebook profile and review on Amazon? (Fraud detection)

Data Science Definition: Business Application:

Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple

Page 45: Intro to Data Science for Non-Data Scientists

Regression

•  Sub-category of Supervised Learning

•  Regression is a type of algorithm that predicts a continuous values.

•  How much would a user spend on a mobile game like CandyCrush?

•  How much would someone spend on healthcare out of pocket?

•  How many attendees will come to this event based on past registration?

Data Science Definition: Business Application:

Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple

Page 46: Intro to Data Science for Non-Data Scientists

Decision Trees

•  Using a tree-like graph or model of decisions and their possible consequence.

•  Medical Testing (e.g. health incidences, etc.)

•  Genealogy breakdowns (e.g. eye color, blood type, etc.)

Data Science Definition: Business Application:

Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple

Page 47: Intro to Data Science for Non-Data Scientists

Deep Learning

•  A category of machine learning algorithms that often use Artificial Neural Networks to generate model.

•  Image classification•  Language processing•  Audio processing•  Outlier and fraud detection

Data Science Definition: Business Application:

Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple

Page 48: Intro to Data Science for Non-Data Scientists

Questions?