data driven action : a primer on data science

103
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ SPRINGONE2GX WASHINGTON, DC Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data Driven Action: A Primer on Data Science Sarah Aerni (@iTweetSarah) Srivatsan Ramanujam (@being_bayesian) Jarrod Vawdrey (@jjvawdrey)

Upload: srivatsan-ramanujam

Post on 12-Apr-2017

559 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

SPRINGONE2GXWASHINGTON, DC

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Data Driven Action: A Primer on Data Science

Sarah Aerni (@iTweetSarah)Srivatsan Ramanujam (@being_bayesian)

Jarrod Vawdrey (@jjvawdrey)

Page 2: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 2

Agenda• Approaches and Open Source Tools for Wrangling and Modeling

Massive Datasets • Sarah Aerni

• Text Analytics at Scale on MPP• Srivatsan Ramanujam

• A Scalable Framework For Real Time Monitoring & Prediction Of Sensor Data

• Jarrod Vawdrey

Page 3: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Our everyday devices are smart and talk to us

Page 4: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Our everyday devices are smart and talk to us

Page 5: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Connected devices take action to make daily life easier.

But what else?

Page 6: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How can IoT help prevent accidents like the Macondo

Disaster ?

Page 7: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Gene Sequencing

Smart GridsCOST TO SEQUENCE ONE GENOMEHAS FALLEN FROM

$100M IN 2001

TO $10K IN 2011TO $1K IN 2014

READING SMART METERSEVERY 15 MINUTES IS3000X MOREDATA INTENSIVE

Stock Market

Social Media

FACEBOOK UPLOADS250 MILLIONPHOTOS EACH DAY

In all industries billions of data points represent opportunities for the Internet of Things

Oil Exploration

Video Surveillance

OIL RIGS GENERATE25000DATA POINTS PER SECOND

Medical Imaging

Mobile Sensors

Page 8: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Smart Systems = Sensors + Digital Brain + Actuators

Problem Formulation

Modeling Step

Data StepApplication Step

Data Science forBuilding Models

Sensors & Actuators

Data Lake

Page 9: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How can data drive true, automated action?

How does this…

Page 10: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How can data drive true, automated action?

…become this?How does this…

Page 11: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How can data drive true, automated action?

• How is data collected?• Where is it stored and processed?• Is there real signal or just noise?• How can we build a predictive model?• When is the right time to take action?

Page 12: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Critical considerations for successful modelingHow to build a predictive

model at scale

Data-driven paradigms, data cleansing and feature engineering

Page 13: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Critical considerations for successful modelingHow to build a predictive

model at scale

Data-driven paradigms, data cleansing and feature engineering

Tradeoffs between model

accuracy and timeliness

Page 14: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Critical considerations for successful modelingHow to build a predictive

model at scale

Data-driven paradigms, data cleansing and feature engineering

Derive insight from models

to change processes

Tradeoffs between model

accuracy and timeliness

Page 15: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Critical considerations for successful modelingHow to build a predictive

model at scale

Data-driven paradigms, data cleansing and feature engineeringUse Cases

Oil Drilling Vaccine Manufacturing

Derive insight from models

to change processes

Tradeoffs between model

accuracy and timeliness

Treating Patients

Page 16: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Drilling into the San Andreas Fault at

Parkfield California.Credit: Stephen H.

Hickman, USGS

Data: The New Oil• Oil & gas generates large amounts of data from sensors

enabling data-driven approaches to improve operationsPredictive maintenance

• Motivation: Failure costs estimated at $150,000/incident (billions annually)*

• Goals – Early warning system– Insights into prominent features impacting operation and

failure– Reduction of non-productive drill time– Reduced incidents

*http://blog.pivotal.io/pivotal/case-studies-2/data-as-the-new-oil-producing-value-for-the-oil-gas-industry

Page 17: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

Page 18: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

Integrated Data

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

Page 19: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

Cleansing

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

ROP

Time

Page 20: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

Cleansing

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

ROP

Time

Drill bit changes

Page 21: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

Cleansing

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

ROP

Time

Page 22: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

WO

BTime

Integrating &

Cleansing

Page 23: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

WO

BTime

Integrating &

Cleansing

Page 24: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

WO

BTime

Integrating &

Cleansing

Page 25: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

WO

BTime

Integrating &

Cleansing

Page 26: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Operator Data( thousands of records )

• Failure details• Component details• Drill Bit details

Drill Rig Sensor Data

( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)

Primary data sources

WO

BTime

Integrating &

Cleansing

Page 27: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

• A failure occurred at the end of this run

Bit

posi

tion

RPM

ROP

WO

B

Page 28: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

• A failure occurred at the end of this run

• Taking a window of time prior to failure, what features should we extract (e.g. variance of RPM, max bit position velocity)?

Bit

posi

tion

Page 29: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

• A failure occurred at the end of this run

• Taking a window of time prior to failure, what features should we extract (e.g. variance of RPM, max bit position velocity)?

RPM

Page 30: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

Predict occurrence of equipment failure in a chosen future time window

Predict remaining life of equipment

Predict Rate-of-Penetration

Page 31: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

Predict occurrence of equipment failure in a chosen future time window

• Logistic Regression• Elastic Net Regularized Regression

(Binomial)• Support Vector Machines

Predict remaining life of equipment

Predict Rate-of-Penetration

Page 32: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

Predict occurrence of equipment failure in a chosen future time window

• Logistic Regression• Elastic Net Regularized Regression

(Binomial)• Support Vector Machines

Predict remaining life of equipment • Cox Proportional Hazards Regression

Predict Rate-of-Penetration

Page 33: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

How are models built using sensor data?

Integrating &

CleansingFeature Building Modeling

Predict occurrence of equipment failure in a chosen future time window

• Logistic Regression• Elastic Net Regularized Regression

(Binomial)• Support Vector Machines

Predict remaining life of equipment • Cox Proportional Hazards Regression

Predict Rate-of-Penetration• Linear Regression• Elastic Net Regularized Regression

(Gaussian)• Support Vector Machines

Page 34: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Finding linear dependencies between variables

ROP = c0+ WOB * cWOB

-10 150

20

40

60

80

100

Rate

of

Pen

etra

tion

(ROP

)

Weight on Bit (WOB)

Linear Regression: Streaming Algorithm

Page 35: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Finding linear dependencies between variables

-10 150

20

40

60

80

100

Rate

of

Pen

etra

tion

(ROP

)

Weight on Bit (WOB)

Linear Regression: Streaming Algorithm

Page 36: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Finding linear dependencies between variables

How to compute with a single scan?

-10 150

20

40

60

80

100

Rate

of

Pen

etra

tion

(ROP

)

Weight on Bit (WOB)

Linear Regression: Streaming Algorithm

Page 37: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Linear Regression: Parallel Computation

Page 38: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Linear Regression: Parallel Computation

Segment 1 Segment 2

Page 39: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Linear Regression: Parallel Computation

Segment 1 Segment 2

Page 40: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Linear regression on 10 million rows in seconds

0 50 100 150 200 250 300 3500

50

100

150

2006 Segments12 Segments18 Segments24 Segments

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

# independent variables

Exec

utio

n tim

e (s

)

Page 41: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Big Data Machine Learning in SQLhttp://madlib.net/

Predictive Modeling Library

Linear Systems• Sparse and Dense Solvers

Matrix Factorization• Single Value Decomposition (SVD)• Low-Rank

Generalized Linear Models• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Cox Proportional Hazards• Regression• Elastic Net Regularization• Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms• Principal Component Analysis (PCA)• Association Rules (Affinity Analysis,

Market Basket)• Topic Modeling (Parallel LDA)• Decision Trees• Ensemble Learners (Random Forests)• Support Vector Machines• Conditional Random Field (CRF)• Clustering (K-means) • Cross Validation

Descriptive Statistics

Sketch-based Estimators• CountMin (Cormode-

Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent

Values)CorrelationSummary

Support Modules

Array OperationsSparse VectorsRandom SamplingProbability FunctionsPMML Export

Page 42: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

P L A T F O R M

Data Science Toolkit

KEY TOOLS KEY LANGUAGES

SQL

Page 43: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Critical considerations for successful modelingHow to build a predictive

model at scale

Data-driven paradigms, data cleansing and feature engineeringUse Cases

Oil Drilling Vaccine Manufacturing

Derive insight from models

to change processes

Tradeoffs between model

accuracy and timeliness

Treating Patients

Page 44: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Opportunities for Data-Driven Decisions in Pharma

Page 45: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing

Input materials Mix Incubate Filter Centrifuge Final Product

Page 46: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing

Input materials Mix Incubate Filter Centrifuge Final Product

SensorsTe

mp

Time

Abs

orba

nce

Elution volume

Velo

city

Time

Page 47: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing

Input materials Mix Incubate Filter Centrifuge Final Product

Tem

p

Time

Abs

orba

nce

Elution volume

Velo

city

Time

• What opportunities exist for intervention, correction?• Which attributes should be used as features in a model?• When is the appropriate time to take action?

Page 48: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing

Input materials Mix Incubate Filter Centrifuge Final Product

Tem

p

Time

Abs

orba

nce

Elution volume

Velo

city

Time

• What opportunities exist for intervention, correction?• Which attributes should be used as features in a model?• When is the appropriate time to take action?

>6 months

Page 49: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Predicting vaccine potency using manufacturing dataModel generation and evaluation

Input materials Mix Incubate Filter Centrifuge Final Product

True Potency

Pre

dict

ed P

oten

cy

>6 months

Page 50: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Predicting vaccine potency using manufacturing dataModel generation and evaluation

Input materials Mix Incubate Filter Centrifuge Final Product

True Potency

Pre

dict

ed P

oten

cy

Data Integratio

nFeature Building Modeling

>6 months

Page 51: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Predicting vaccine potency using manufacturing dataModel generation and evaluation

• Tracing product through pipeline

• Integrating manual and automated data collection

• Missing data and outliers

Data Integratio

nFeature Building Modeling

Input materials Mix Incubate Filter Centrifuge Final Product

True Potency

Pre

dict

ed P

oten

cy

>6 months

Page 52: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Predicting vaccine potency using manufacturing dataModel generation and evaluation

• Extract multiple features from particular steps (duration, mean, median, etc.)

• Considerations• Tunable vs. measures• Step in pipeline

Data Integratio

nFeature Building Modeling

Input materials Mix Incubate Filter Centrifuge Final Product

True Potency

Pre

dict

ed P

oten

cy

>6 months

Tem

p

Time

Abs

orba

nce

Elution volume

Velo

city

Time

Page 53: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Predicting vaccine potency using manufacturing dataModel generation and evaluation

Input materials Mix Incubate Filter Centrifuge Final Product

True Potency

Pre

dict

ed P

oten

cy

Data Integratio

nFeature Building Modeling

>6 months

• Partial least squares• Random forest• Regularized regression

Page 54: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Interpreting the utility of a measure obtained during manufacturing based on model outcomesBuilding insights from models

Some features may reveal tunable parameters to alter potency, others may simply be markers

Opportunities to provide real-time feedback on data entry errors and predicted potency outcomes

Assayed value Duration of a step

Pot

ency

Pot

ency

Correlation=0.45 Correlation=0.38

Page 55: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

55© Copyright 2013 Pivotal. All rights reserved.

Internet of Things in HealthcareImproving Patient Outcomes and Increasing Efficiency

Page 56: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Critical considerations for successful modelingHow to build a predictive

model at scale

Data-driven paradigms, data cleansing and feature engineeringUse Cases

Oil Drilling Vaccine Manufacturing

Derive insight from models

to change processes

Tradeoffs between model

accuracy and timeliness

Treating Patients

Page 57: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Beyond monitor alerts for crashing patients–Prediction means preventionPowering the Connected Hospital

Page 58: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

1 53 42 6 7

8 1210 119 13 14

15 1917 1816 20 21

22 2624 2523 27 28

29 30

SUNDAY THURSDAYTUESDAY WEDNESDAYMONDAY FRIDAY SATURDAY

SEPTEMBER 2013CDC – 2011- Number of Health Care Visits Per Year - Age Adjusted

31 2 4 5

6 108 97 11 12

13 1715 1614 18 19

20 2422 2321 25 26

27 3129 3028

SUN THUTUE WEDMON FRI SAT

OCTOBER 2013

1 53 42 6 7

8 1210 119 13 14

15 1917 1816 20 21

22 2624 2523 27 28

29 30

SUN THUTUE WEDMON FRI SATSEPTEMBER 2013

A Snapshot

1 2

3 75 64 8 9

10 1412 1311 15 16

17 2119 2018 22 23

24 2826 2725 29 30

SUN THUTUE WEDMON FRI SAT

NOVEMBER 2013

1 53 42 6 7

8 1210 119 13 14

15 1917 1816 20 21

22 2624 2523 27 28

29 3130

SUN THUTUE WEDMON FRI SATDECEMBER 2013

AnotherSnapshot

Page 59: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

What happens between doctor visits?

Blood GlucoseTarget Zone

31 2 4 5

6 108 97 11 12

13 1715 1614 18 19

20 2422 2321 25 26

27 3129 3028

SUN THUTUE WEDMON FRI SAT

OCTOBER 2013

1 53 42 6 7

8 1210 119 13 14

15 1917 1816 20 21

22 2624 2523 27 28

29 30

SUN THUTUE WEDMON FRI SATSEPTEMBER 2013

A Snapshot

1 2

3 75 64 8 9

10 1412 1311 15 16

17 2119 2018 22 23

24 2826 2725 29 30

SUN THUTUE WEDMON FRI SAT

NOVEMBER 2013

1 53 42 6 7

8 1210 119 13 14

15 1917 1816 20 21

22 2624 2523 27 28

29 3130

SUN THUTUE WEDMON FRI SATDECEMBER 2013

AnotherSnapshot

Page 60: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

The Promise of Internet of Humans Smart contact lenses and sensors to identify and alert

patients before catastrophic events (e.g. blood sugar drop for diabetics)

Wearables to track patient disease progression using objective measures

Track patient adherence Detect disease outbreaks using sequencing in sewer

system samples ECG monitoring on mobile phones for early alerting of

stroke

Page 61: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

SPRINGONE2GXWASHINGTON, DC

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Scaling Java Applications for NLP on MPP through PL/Java

Srivatsan Ramanujam@being_bayesian

Page 62: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 62

Text Analysis at Scale: Business Use Cases

Page 63: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Sentiment Analysis for Churn PredictionCustomerA major telecom company

Business ProblemReducing churn through more accurate models

Challenges• Existing models only used structured

features

• Call center memos had poor structure and had lots of typos

Solution• Built sentiment analysis models to predict

churn and topic models to understand topics of conversation in call center memos

• Achieved 16% improve in ROC curve for Churn Prediction

Page 64: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Predicting Commodity Futures using Twitter

Customer

A major a agri-business cooperative

Business Problem

Predict price of commodity futures through Twitter

Challenges

• Language on Twitter does not adhere to rules of grammar and has poor structure

• No domain specific label corpus of tweet sentiment – problem is semi-supervised

Solution

• Built Sentiment Analysis and Text Regression algorithms to predict commodity futures from Tweets

• Established the foundation for blending the structured data (market fundamentals) with unstructured data (tweets)

http://www.slideshare.net/SrivatsanRamanujam/sramanujam-taw-2014

Page 65: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 65

Platform and Tools

Page 66: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Data Lake

Business Levers

Apps

Pipeline of a Data Science Driven App

MLlib

PL/X

Model Building

Model Tuning

Continuous Model Improvement

Data Feeds

Ingest Filter Enrich

SinkSpringXD

Greenplum

Page 67: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Pivotal Greenplum MPP DBThink of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field

(or randomly)

Page 68: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

• For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R or C/C++

• The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment

StandbyMaster

MasterHost

SQL

Interconnect

Segment Host

SegmentSegment

Segment Host

SegmentSegment

Segment Host

SegmentSegment

Segment Host

SegmentSegment

Data Parallelism through PL/X

CREATE FUNCTION pymax ( a integer, b integer)RETURNS integerAS $$  if a > b:    return a  return b$$ LANGUAGE plpythonu;

SQL wrapper

Source language

codeSource

language declaration

User Defined Functions

Page 69: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

PL/X : Libraries We are able to tap into the vast collection of libraries from the open

source ecosystem in languages like Python, R and Java and apply those for data parallel problems

PL/X

CoreNLP

http://www.slideshare.net/SrivatsanRamanujam/pivotal-data-labs-technology-and-tools-in-our-data-scientists-arsenal

Page 70: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Model Parallelism Data Parallel computation via PL/X libraries only allow us to run ‘n’

models in parallel. This works great when we are building one model for each value of

the group by column, but we need parallelized algorithms to be able to build a single model on all the available data

For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.

Page 71: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

MADlib: Scalable, in-database ML

Page 72: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

GPText

Standby

SegmentSegment Segment Segment…

Master

SQL GPTextScalable - one text processor instance per segment: “MPP

Text” - can linearly scale High Availability – ReplicatedDatabase management features

• Backup/Restore• Online Expansion• Data Recovery• Performance Monitoring

Full Text Search - flexible indexing and • search (stemming, phonetic search, • multi-lingual search, etc.)

Join structured & text in one queryAdvanced Text Analytics Platform

• Can be run ad-hoc• Supports multiple machine learning algorithms • Extensible

Page 73: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 73

Sentiment Analysis of Tweets

Page 74: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Sentiment Analysis – Challenges Language on Twitter

doesn’t adhere to rules of grammar, syntax or spelling

We don’t have labeled data for our problem. The tweets aren’t tagged with sentiment

Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile!

“Cool”

Page 75: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Extracting Context – Part-of-speech tagging

Part-of-speech tagging (POS tagging) helps us extract contexts associated with sentiment words (an adjectives dictionary you may have access to).

This simple approach was first described in a classic paper by Peter D. Turney in 2002 and it is as follows:1. Apply POS-tagging on sentences to tag words and their part-of-speech.2. Extract 2-token phrases that provide context (ex: ADJECTIVE followed by

NOUN, NOUN followed by ADJECTIVE , ADVERB followed by a VERB)3. Use a reference corpus to identify count of co-occurrence of your

extracted phrases with a strongly positive word like “excellent” compared to a strongly negative word like “poor” and use that to compute a “polarity score”.

4. Sentiment associated with your sentence can be the average polarity score of all phrases in your sentence.

Page 76: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Sentiment Analysis – Approach

1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)

Phrase Extraction

Semi-Supervised Sentiment Classification

Phrasal Polarity Scoring

Sentiment Scored Tweets

Use learned phrasal polarities to score sentiment of new

tweets

Part-of-speech tagger1

Break-up Tweets into tokens and tag their

parts-of-speech

Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets Custom algorithm to extract contextual cues & score sentiment of tweets

Page 77: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 77

DS Pipeline: Topic and Sentiment Analysis of Tweets

Page 78: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Topic and Sentiment Analysis of Tweets

Stored on Data Lake

Tweet Stream

(PXF/gpfdist)Loaded as

external tables

Parallel Parsing of JSON and extraction of fields using PXF

Topic Analysis through MADlib

pLDA

Sentiment Analysis through custom

PL/Python functions

Pivotal Cloud

Foundry

55 million tweets/day

http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets-on-pivotal-greenplum-database

Page 79: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Spring XD Components

[user@smdw ~]$ xd-singlenode --hadoopDistro phd20

[user@smdw ~]$ xd-shell --hadoopDistro phd20

Spring XD SNE

Spring XD Shell

xd:> stream create --name gnipdecahose --definition "http --port=9009 | hdfs --directory=/user/decahose/ --partitionPath=dateFormat('yyyy/MM/dd')" --deploy

Create Stream: HTTP Source, HDFS Sink

xd:> stream destroy --name gnipdecahoseDestroy stream

Page 80: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Parallel Parsing of JSON using PXFRaw JSON

Page 81: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Parallel Parsing of JSON using PXF

http://blog.pivotal.io/pivotal/products/analyzing-raw-twitter-data-using-hawq-and-pxf

Natively parse JSON from external table

Page 82: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

PL/Java: In-database parallel POS-tagging UDT and UDF Usage

Page 83: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

PL/Java: In-database parallel POS-tagging

Page 84: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 84

Demo: Topic and Sentiment Analysis of Tweets

Page 85: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Topic and Sentiment Analysis Engine (Demo)

http://www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013

Page 86: Data Driven Action : A Primer on Data Science

SPRINGONE2GXWASHINGTON, DC

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

A Scalable Framework For Real Time Monitoring & Prediction Of Sensor Data

Jarrod Vawdrey@jjvawdrey

Page 87: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Connected cars can produce more than 25GB of data per

hour

18x F1 cars (100+ sensors each)

generated 243 terabytes of data from

their vehicles at the 2014 U.S. GRAND PRIX

GE jet engine produces ~1

terabyte of data on a single cross country flight

The Explosion Of Sensor DataOrganizations have started to apply sensors to all kinds of operational equipment in order to gain added visibility into their day to day activities

87

http://www.bloomberg.com/bw/articles/2012-12-06/ge-tries-to-make-its-machines-cool-and-connectedhttps://www.hds.com/assets/pdf/hitachi-point-of-view-internet-on-wheels-and-hitachi-ltd.pdfhttp://www.forbes.com/sites/frankbi/2014/11/13/how-formula-one-teams-are-using-big-data-to-get-the-inside-edge/

Examples

Page 88: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

The Value Of Collecting Sensor Data

88

When used effectively, the data streaming off of sensors can be a huge source of value, providing insights that generate additional revenue and reduce operating costs.

UPS• Through the use of

telematics UPS has been able to optimize delivery schedules and reduce gas usage by 25 gallons per driver per year. In the US this will reduce fuel consumption by 1.4 million gallons annually.

US Government• Using data collected from

sensors across data centers, the General Services Administration was able to reduce total data center power usage by 17% (~$30k) in it’s USDA facility.

Dundee Precious Metals• Outfitting miners and

machinery with internet enabled sensors helped DPM lower production costs from $60 a ton to $40.

http://www.automotive-fleet.com/article/story/2010/07/green-fleet-telematics-sensor-equipped-trucks-help-ups-control-costs/page/2.aspxhttp://energy.gov/eere/femp/wireless-sensor-networks-data-centershttp://www.wsj.com/articles/mining-sensor-data-to-run-a-better-gold-mine-1424226463

Examples

Page 89: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Server Room Test Use CaseProblem: A single Supermicro 1U Server (2x Xeon 2.66ghz Processors, 64gb RAM, 4x 2TB HDD) heats up a 6ftx4ft server room (closet) to above safe operating temperature (95oF) in under two hours … the Hadoop cluster in the test server room had 4 servers + 2 switches!

89

Test Hadoop cluster

Page 90: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Server Room Test Use CaseSolution: Add an air conditioning unit to the server room and keep servers within the 41oF to 95oF safe operating temperature.

New Problem: 12amp AC requires separate power circuit and breaker. If AC trips breaker no guarantee that servers will also shut off.

Still concerned with overheating & now concerned with power consumption!

90

Tripp Lite Directed AC

Page 91: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Server Room Test Use CaseSolution: Add sensors to server room in order to …• Monitor temperature

remotely• Predict & alert potential A/C

and server failures which could cause overheating

• Optimize AC temperature setting

91

USB Temperature Sensor

http://www.amazon.com/gp/product/B002VA813U?psc=1&redirect=true&ref_=

oh_aui_detailpage_o04_s00

Arduino LM35 Temperature Sensor

http://www.lightinthebox.com/digital-temperature-sensor-module-ds18b20-for-arduino-55-125_p903326.

html

Page 92: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Challenges Found With Adding Sensors

92

Challenges identified in the server room test environment are commonly found in business critical applications

Data Collection• Sensor failure• Sensor becomes

detached from collection system

• Sensors may not be collocated

• Integrating external sources

Data Volumes• Handling large

data volumes• Trade off

between granularity of data and volume

• Data storage which allows rapid access for analysis and modeling

Measurement• Handling missing

values• Building

aggregate metrics after system failure

Machine Learning• Feature

engineering for real time scoring

• Model performance testing and retraining indicators

Page 93: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Planning For Real Time Prediction: Two Primary Approaches

93

Batch Modeling: Model developed on a full training

dataset and published to scoring mechanism

Real Time Scoring: Model applied to new data as it

becomes available to generate prediction

Each data point in a sequence is used to update the model as

it becomes available and produce a prediction

Batch Modeling & Real time Scoring(Offline Learning) Online Learning

Hybrid approaches also exist

Page 94: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Tools Used: Spring XD

94

“Spring XD is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export.” http://projects.spring.io/spring-xd/

Key Concepts• Streams: Defines a data pipe from source to sink, that may pass through

multiple processors• Source: The data provider (e.g. HTTP, JDBC, RabbitMQ)• Processor: Processing tasks operate on data being passed through a

stream• Sink: Termination point for data in a stream (e.g. HDFS, JDBC, File)

• Jobs: Batch processors launched from Spring-XD• Taps: ‘Listen’ to data being passed through a stream

Page 95: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Tools Used: Other

95

Python RabbitMQ HDFS HAWQ R MADlib RedisProgramming language

Message broker

Hadoop distributed file system

SQL query engine for Hadoop

Statistical programming language/application

Scalable SQL machine learning library

In-memory data store

Scripts to interfaces with sensors and send readings to RabbitMQ

Real time model scoring

Queue sensor readings

Short term readings cache - if connection to Spring-XD drops

Store sensor readings, other data and models

Access data stored in Hadoop using SQL

Provide access to R for modeling (via pl/r)

Batch modeling

Batch modeling

Short term application data storage (e.g. counters, aggregates)

Page 96: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

HDFS/HAWQ(Sink)

Framework For Monitoring & Modeling (Offline)

96

Spring-XD

S1

Rabb

itMQ

– Mes

sage

Bro

ker

(Sou

rce)

Real-time Monitoring(Tap)

S2

SN

…BatchModel

Training(Job)

Real-time ModelScoring

(Processor)

Sensors PythonListeners

PN

P2

P1

Page 97: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

HDFS/HAWQ(Sink)

Framework For Monitoring & Modeling (Online)

97

Spring-XD

S1

Rabb

itMQ

– Mes

sage

Bro

ker

(Sou

rce)

Real-time Monitoring

(Tap)

S2

SN

Sensors PythonListeners

PN

P2

P1Online

Learning(Tap)

Page 98: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Server Room Test Use Case – Model Development

98

Server Data• 4x Rack Sensors

(USB)• 1x Server Room

Sensor (USB)• 1x Outside Room

Sensor (Arduino)• 1x Outdoor Sensor

(Arduino)• A/C settings

temperature• Ganglia RRD data

(Server logs)

External Data• Weather

Underground 10 Day Forecast

Data Cleanup & Integration

FeatureGeneratio

n

Time series models(ARIMA, VARs)

Event prediction models(Log Reg, SVM)

Page 99: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Server Room Test Use Case - Application• Node.js application

used to serve results remotely

• D3.js visualization of observed and predicted readings

• Python package ‘smtplib’ used to send email alerts when failure or out of range temperature event predicted

99

Page 100: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

Key Takeaways

100

• Organization who have embraced data collection from sensors and are using this data to generate real time actionable insights are already generating huge amounts of value

• Many challenges exist when working with streaming data which can be solved for using a framework built around Spring-XD

• Flexibility to plug and play the best tool for the job is crucial in implementing an scalable real time systems

Page 101: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!

Page 102: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/

FOR FURTHER INFO, CHECKOUT…

• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data

• Pivotal Blog @ http://blog.pivotal.io

• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal

• Pivotal Academy @ https://pivotal.biglms.com

• Or reach out to your local Pivotal Account Executive…

Page 103: Data Driven Action : A Primer on Data Science

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 103

Safe Harbor StatementThe following is intended to outline the general direction of Pivotal's offerings. It is intended for information purposes only and may not be incorporated into any contract. Any information regarding pre-release of Pivotal offerings, future updates or other planned modifications is subject to ongoing evaluation by Pivotal and is subject to change. This information is provided without warranty or any kind, express or implied, and is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions regarding Pivotal's offerings. These purchasing decisions should only be based on features currently available. The development, release, and timing of any features or functionality described for Pivotal's offerings in this presentation remain at the sole discretion of Pivotal. Pivotal has no obligation to update forward looking information in this presentation.