![Page 1: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/1.jpg)
From Data to Decisions: New Strategies for Deploying Analytics Using Clouds
Robert GrossmanOpen Data Group
July 29, 2009
![Page 2: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/2.jpg)
Cloud computing has changed analytic infrastructure and enabled new classes of analytic algorithms. It’s time to rethink your analytic strategy.
Analytic Strategy
Analytics Analytic Infrastructure
Overview
![Page 3: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/3.jpg)
Part 1
Quick Review of Clouds
3
![Page 4: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/4.jpg)
What is a Cloud? Clouds provide on-demand resources or
services over a network, often the Internet, with the scale and reliability of a data center.
No standard definition. Cloud architectures are not new. What is new:– Scale– Ease of use– Pricing model.
4
![Page 5: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/5.jpg)
5
Scale is new.
![Page 6: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/6.jpg)
Elastic, Usage Based Pricing Is New
6
1 computer in a rack for 120 hours
120 computers in three racks for 1 hour
costs the same as
Elastic, usage based pricing turns capex into opex. Clouds can manage surges in computing needs.
![Page 7: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/7.jpg)
Simplicity Offered By the Cloud is New
7
+ .. and you have a computer ready to work.
A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
![Page 8: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/8.jpg)
Two Types of Clouds
On-demand resources & services over a network at the scale of a data center
On-demand computing instances (IaaS)– IaaS: Amazon EC2, S3, etc.; Eucalyptus– supports many Web 2.0 applications/users
On-demand cloud services for large data cloud applications (PaaS for large data clouds)– GFS/MapReduce/Bigtable, Hadoop, Sector, …– Manage and compute with large data (say 10+ TB)
8
![Page 9: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/9.jpg)
Cloud Architectures – How Do You Fill a Data Center?
Cloud Storage Services
Cloud Compute Services (MapReduce & Generalizations)
Cloud Data Services (BigTable, etc.)
Quasi-relational Data Services
App App App App App
App App
App App
on-demand computing capacity
App App App…
on-demand computing instances
![Page 10: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/10.jpg)
What is Analytic Infrastructure ...
10
Part 2
… and why you should care.
![Page 11: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/11.jpg)
What is Analytics?
Short Definition Using data to make decisions.
Longer Definition Using data to take actions and make decisions
using models that are empirically derived and statistically valid.
It is important to understand the difference between reporting and analytics.
11
![Page 12: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/12.jpg)
12
Risk Models
Direct Marketing Models
Online Models
![Page 13: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/13.jpg)
What is the Size of Your Data?
Small– Fits into memory
Medium– Too large for memory– But fits into a database– N.B. databases are designed for safe writing of rows
Large– To large for a database– But can use specialized file system (column-wise)– Or storage cloud (Google File System, Hadoop DFS)
13
![Page 14: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/14.jpg)
(Very Simplified) Architectural View
The Predictive Model Markup Language (PMML) is an XML language for statistical and data mining models (www.dmg.org).
With PMML, it is easy to move models between applications and platforms.
14
Model Producer
PMMLModelData
![Page 15: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/15.jpg)
(Simplified) Architectural View
PMML also supports XML elements to describe data preprocessing.
15
Model Producer
PMMLModel
DataData Pre-
processing features
algorithms to estimate models
![Page 16: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/16.jpg)
Three Important Interfaces
16
Model ProducerData
Data Pre-processing
data
PMMLModel
Model Consumer
scores
Post Processing
actions
1 1
2
2
PMMLModel
3 3
Modeling Environment
Deployment Environment
1
![Page 17: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/17.jpg)
Actually, This is a Typically a Component in a Workflow
17
![Page 18: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/18.jpg)
With the proper analytic infrastructure, cloud computing can be used for data preprocessing, for scoring, for producing models, and as a platform for other services in the analytic infrastructure.
18
![Page 19: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/19.jpg)
Cloud Programming Models for Working With Large Data
19
Part 3
![Page 20: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/20.jpg)
Map-Reduce Example
Both input & output are (key, value) pairs Input is file with one document per record User specifies map function– key = document URL– Value = terms that document contains
(“doc cdickens”, “it was the best of times”)
“it”, 1“was”, 1“the”, 1“best”, 1map
![Page 21: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/21.jpg)
Example (cont’d) MapReduce library gathers together all pairs
with the same key value (shuffle/sort phase) The user-defined reduce function combines all
the values associated with the same key
key = “it”values = 1, 1
key = “was”values = 1, 1key = “best”values = 1key = “worst”values = 1
“it”, 2“was”, 2“best”, 1“worst”, 1reduce
![Page 22: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/22.jpg)
Using Clouds for Scoring (Model Consumers)
22
Part 4
![Page 23: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/23.jpg)
What is a Statistical/Data Mining Model? Infrastructure– Inputs: data attributes, mining attributes– Outputs, targets– Transformations– Segmented models, ensembles of models
Models that are part of a standard– Trees, SVMs, neural networks, cluster models, etc.– In this case, only need to specify parameters
Arbitrary models– e.g. arbitrary code that takes inputs to outputs
23
![Page 24: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/24.jpg)
From an Architectural Viewpoint
In an operational environment in which models are being deployed, it may be useful to “Just so no to viewing models as arbitrary code”
The deployment can be much shorter if a scoring engine reads a PMML file instead of integrating a new piece of code containing a model.
24
![Page 25: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/25.jpg)
Model Producers/Consumers in Clouds Model Consumers take analytic models and use
them to score data– Very easy to deploy in a cloud– Deploy a scoring engine in a cloud and then simply
read PMML files– Very easy to scale up with cloud surges
Model Producers take data and produce models– Data parallel applications can be ported to clouds.– Others require weighing several factors.
25
![Page 26: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/26.jpg)
26
Model ProducerData
Data Pre-processing
data
PMMLModel
Model Consumer
scores
Post Processing
actions
PMMLModel
Modeling can be done in-house.
Scoring engine deployed in a cloud.
Sometimes it makes sense to the pre-processing in the cloud, especially if the data is there.
![Page 27: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/27.jpg)
Summary
Innovation ImpactPMML With PMML, it is easy to move models and
preprocessing between apps; supports life cycle management of models.
Scoring engines
Simplifies deployment of models; enables scoring in clouds.
Large data clouds
1) Can preprocess data to build features on TB size datasets; 2) Can build analytic models on TB size datasets.
![Page 28: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5](https://reader030.vdocuments.us/reader030/viewer/2022032514/55d55000bb61ebfe588b4574/html5/thumbnails/28.jpg)
For More Information
Contact information: Robert Grossmanblog.rgrossman.comwww.rgrossman.com
28
www.opendatagroup.com