intro to data science for non-data scientists

H2O.ai Machine Intelligence

Data Science for Non-Data Scientists

Erin LeDell Ph.D.

Silicon Valley Big Data Science August 2015


H2O.ai

H2O Company

H2O Software

• Team: 35. Founded in 2012, Mountain View, CA• Stanford Math & Systems Engineers

• Open Source Software • Ease of Use via Web Interface• R, Python, Scala, Spark & Hadoop Interfaces• Distributed Algorithms Scale to Big Data


Scientific Advisory CouncilDr. Trevor Hastie

Dr. Rob Tibshirani

Dr. Stephen Boyd

• John A. Overdeck Professor of Mathematics, Stanford University• PhD in Statistics, Stanford University• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Co-author with John Chambers, Statistical Models in S• Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar)

• Professor of Statistics and Health Research and Policy, Stanford University• PhD in Statistics, Stanford University• COPPS Presidents’ Award recipient• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Author, Regression Shrinkage and Selection via the Lasso• Co-author, An Introduction to the Bootstrap

• Professor of Electrical Engineering and Computer Science, Stanford University• PhD in Electrical Engineering and Computer Science, UC Berkeley• Co-author, Convex Optimization• Co-author, Linear Matrix Inequalities in System and Control Theory• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction

Method of Multipliers


What is Data Science?

Problem Formulation

• Identify an outcome of interest and the type of task: classification / regression / clustering

• Identify the potential predictor variables• Identify the independent sampling units

• Conduct research experiment (e.g. Clinical Trial)• Collect examples / randomly sample the population• Transform, clean, impute, filter, aggregate data• Prepare the data for machine learning — X, Y

• Modeling using a machine learning algorithm (training)• Model evaluation and comparison• Sensitivity & Cost Analysis

• Translate results into action items• Feed results into research pipeline

Collect & Process Data

Machine Learning

Insights & Action

H2O.ai Machine Intelligence Source: marketingdistillery.com

http://marketingdistillery.com


What is Machine Learning?

What it is: ✤ “Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel, 1959)

✤ “Machine learning and statistics are closely related fields. The ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics.” (Jordan, 2014)

✤ M.I. Jordan also suggested the term data science as a placeholder to call the overall field.

Unlike rules-based systems which require a human expert to hard-code domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone.

What it’s not:


Classification

Clustering

Machine Learning Overview

• Predict a real-valued response (viral load, weight)• Gaussian, Gamma, Poisson and Tweedie • MSE and R^2

• Multi-class or Binary classification• Ranking• Accuracy and AUC

• Unsupervised learning (no training labels)• Partition the data / identify clusters• AIC and BIC

Regression


Machine Learning Workflow

Source: NLTK

Example of a supervised machine learning workflow.


ML Model Performance

Test & Train • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30)

• Train a model using the “training set” and evaluate performance on the “test set” or “validation set.”

• Train & test K models as shown.

• Average the model performance over the K test sets.

• Report cross-validated metrics.

• Regression: R^2, MSE, RMSE• Classification: Accuracy, F1, H-measure• Ranking (Binary Outcome): AUC, Partial AUC

K-foldCross-validation

Performance Metrics


What is Deep Learning?

What it is: ✤ “A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015)

✤ Deep neural networks have more than one hidden layer in their architecture. That’s what’s “deep.”

✤ Very useful for complex input data such as images, video, audio.

Deep learning architectures, specifically artificial neural networks (ANNs) have been around since 1980, so they are not new. However, there were breakthroughs in training techniques that lead to their recent resurgence (mid 2000’s). Combined with modern computing power, they are quite effective.

What it’s not:


Deep Learning Architecture

Example of a deep neural net architecture.


What is Ensemble Learning?

What it is: ✤ “Ensemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms.” (Wikipedia, 2015)

✤ Random Forests and Gradient Boosting Machines (GBM) are both ensembles of decision trees.

✤ Stacking, or Super Learning, is technique for combining various learners into a single, powerful learner using a second-level metalearning algorithm.

Ensembles typically achieve superior model performance over singular methods. However, this comes at a price — computation time.

What it’s not:


Where to learn more?

• H2O Online Training (free): http://learn.h2o.ai• H2O Slidedecks: http://www.slideshare.net/0xdata• H2O Video Presentations: https://www.youtube.com/user/0xdata• H2O Community Events & Meetups: http://h2o.ai/events• Machine Learning & Data Science courses: http://coursebuffet.com

Customers ! Community ! Evangelists !

November 9, 10, 11 Computer History Museum

H 2 O W O R L D . H 2 O . A I

!

20% off registration using code:

h2ocommunity !


Questions?

@ledell on Twitter, GitHub [email protected]

http://www.stat.berkeley.edu/~ledell

Data Science for Non-Data Scientists

aka. How the Business Views Data Science

Chen HuangAugust 20, 2015

Agenda

•  Introduction•  Data Science Primer•  Working with Data Scientists•  Decoding the Data Science Lingo•  Q&A

Introduction

•  Who am I? •  Why am I giving this talk?

Who am I?•  Data Strategist•  Career in Business Intelligence,

Analytics, and Big Data•  Various roles

•  Consultant•  Developer•  Business and Data Analyst•  Product Manager•  Functional and Technical Trainer •  Client Services

•  Worked in various industries•  Health care, pharmaceutics,

communications and high tech, consumer products, automotive, finance, government contracting

August, 2015 – San Francisco, CA

Why am I giving this talk?

July, 2011 – Beijing, China

Data Science Primer

•  What can Data Science do for the Business?•  Applications of Data Science •  Data-Driven Decisions•  What does a Data Scientist do?•  Data Science Skills

What can Data Science do for the Business?

A: Data science! Extracting useful information and knowledge from large volumes of data in order to improve business decision-making or providing the business insights to make data-driven decisions

Data Business

What can Data do?

Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science

Applications of Data Science

Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science

Data-Driven Decisions

•  Practice of basing decisions on data, rather than purely on intuition

•  There is evidence that data-driven decision making and big data technologies substantially improve business performance

The Art and Science of Data Science

•  Discover unknowns in data•  Obtain predictive, actionable insights•  Communicate business data stories•  Build confidence in decision making•  Create valuable Data Products that has business

impacts

http://www.slideshare.net/datasciencelondon/big-data-sorry-data-science-what-does-a-data-scientist-do

What does a Data Scientist do?

•  Data curiosity. Explore data. Discover unknowns•  Understand data relationships •  Understand the business, has domain knowledge•  Can tell relevant stories with data•  Holistic view of the business•  Knows machine learning, statistics, probability•  Can hack and code•  Define and test an hypothesis, run experiences•  Asks good questions

http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science

Data Science Skills

Image: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

Working with Data Scientists

•  Collaboration•  Data Science Cycle•  Organizational Models for Data Science Teams

Working with Data Scientists

Data ScienceBusiness

Data Engineering

Data Science Cycle

Image: https://en.wikipedia.org/wiki/Data_science

Organizational Models for Data Science Teams

Image: http://www.slideshare.net/emcacademics/building-data-science-teams-31057129

Decoding the Data Science Lingo

Machine Learning

•  A subfield of computer science and artificial intelligence (AI) that focuses on the design of systems that can learn from and make decisions and predictions based on data.

•  Machine learning enables computers to act and make data-driven decisions rather than being explicitly programmed to carry out a certain task.

•  Machine Learning programs are also designed to learn and improve over time when exposed to new data.

•  Everything!Data Science Definition: Business Application:

Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple

Unsupervised Learning

Data Science Definition:•  Where a program, given a

dataset, can automatically find patterns and relationships within the dataset.

•  The business will decide how deeply or many categories there are.

•  Clustering or grouping of like data.

•  Examples: k-means clustering, hierarchical clustering

Business Application:•  Customer segmentation•  Understanding users and

behaviors•  Classifying unknown and pre-

defined images into categories


Supervised Learning

•  Where a program is “trained” on a pre-defined dataset.

•  Based off its training data the program can make accurate decisions when given new data.

•  Classifying Twitter sentiments•  Recommender systems

Data Science Definition: Business Application:


Score

•  Number of ways to evaluate how well the model assigns the correct class value to the test instances.

•  Confidence gauge Data Science Definition: Business Application:

Definition: https://mlcorner.wordpress.com/tag/scoring/

Score Cont.•  True Positive (TP): If the instance

is positive and it is classified as positive False

•  Negative (FN): If the instance is positive but it is classified as negative True

•  Negative (TN): If the instance is negative and it is classified as negative False

•  Positive (FP): If the instance is negative but it is classified as positive

•  Classification problems:•  Precision = the number of times you correctly classify = TP/(TP+FP)•  Accuracy = proportion of correctly classified instances = (TP+TN)/(TP+TN

+FP+FN)•  Recall or Sensitivity = the number of positive that you correctly classify out

of all the actual positives = TP/(TP+FN)•  Specificity = classifier’s ability to identify negative results = TN/(TN+FP)

Classification

•  Sub-category of Supervised Learning

•  Classification is the process of taking some sort of input and assign a label to it. The predictions are discrete, categories, or “yes or no” nature.

•  Examples: Logistic Regression, Random Forest

•  What customers should a company target with its marketing campaigns?

•  Is this Nigerian prince committing fraud? (Spam classification)

•  Is this actually Barack Obama’s Facebook profile and review on Amazon? (Fraud detection)



Regression

•  Sub-category of Supervised Learning

•  Regression is a type of algorithm that predicts a continuous values.

•  How much would a user spend on a mobile game like CandyCrush?

•  How much would someone spend on healthcare out of pocket?

•  How many attendees will come to this event based on past registration?



Decision Trees

•  Using a tree-like graph or model of decisions and their possible consequence.

•  Medical Testing (e.g. health incidences, etc.)

•  Genealogy breakdowns (e.g. eye color, blood type, etc.)



Deep Learning

•  A category of machine learning algorithms that often use Artificial Neural Networks to generate model.

•  Image classification•  Language processing•  Audio processing•  Outlier and fraud detection



Questions?