![Page 1: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/1.jpg)
1 © Copyright 2012 EMC Corporation. All rights reserved.
MAD SKILLS: NEW ANALYSIS PRACTICES FOR BIG DATA
![Page 2: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/2.jpg)
2 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data Has Arrived
Source : 2011 IDC Digital Universe Study
GROW 44X IN THE NEXT 10
YEARS
THE DIGITAL UNIVERSE WILL
![Page 3: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/3.jpg)
3 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data: Hype or Reality? • Do we have a Big Data problem in New Zealand?
• Do we have a Big Data problem in my organisation?
• Do I really need to care?
![Page 4: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/4.jpg)
4 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data: Hype or Reality? • Do we have a Big Data problem in New Zealand? Maybe.
• Do we have a Big Data problem in my organisation? Maybe.
• Do I really need to care? ABSOLUTELY.
• Big Data Practices is about widely applicable
![Page 5: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/5.jpg)
5 © Copyright 2012 EMC Corporation. All rights reserved.
Today
Shadow systems ‘Shallow’
Business Intelligence
Static schemas accrete over
time
Slow- moving models
Slow- moving
data
Departmental warehouses
Data Sources
Images
EDW
![Page 6: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/6.jpg)
6 © Copyright 2012 EMC Corporation. All rights reserved.
A Common Analytics Environment
Data Warehouses, File Systems
SAS/ACCESS
![Page 7: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/7.jpg)
7 © Copyright 2012 EMC Corporation. All rights reserved.
A Common Analytics Environment
Data Warehouse, File Systems
Cluster Computer
SAS/CONNECT SAS/ACCESS
![Page 8: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/8.jpg)
8 © Copyright 2012 EMC Corporation. All rights reserved.
A Common Analytics Environment
Data Warehouse, File Systems
Cluster Computer
SAS/CONNECT SAS/ACCESS
Bag of Tricks - DATA Step Bravado - SQL Pushdown - SASFILE - SAS Views - Compression
![Page 9: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/9.jpg)
9 © Copyright 2012 EMC Corporation. All rights reserved.
A Leaner Configuration
Parallel Database
SAS/ACCESS
Greenplum ... ...
... ...
![Page 10: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/10.jpg)
10 © Copyright 2012 EMC Corporation. All rights reserved.
An Integrated Architecture
Parallel Database
SAS/ACCESS SAS/CONNECT
Greenplum ... ...
... ...
![Page 11: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/11.jpg)
11 © Copyright 2012 EMC Corporation. All rights reserved.
SAS High Performance Appliance on Greenplum
Parallel Greenplum Database
SAS/ACCESS SAS/CONNECT
Greenplum ... ...
... ...
![Page 12: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/12.jpg)
12 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Master
Analytical Computation and data request sent to the worker nodes
![Page 13: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/13.jpg)
13 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Master
Data request sent to the database, data slice moved into memory
![Page 14: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/14.jpg)
14 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Master
Analytic Processing with internode communication
![Page 15: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/15.jpg)
15 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Master
Worker node results returned to the Master Node, finalize computation
![Page 16: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/16.jpg)
16 © Copyright 2012 EMC Corporation. All rights reserved.
SAS-GP High-Performance Analytics
Worker Node 1
Worker Node 2
Worker Node N
Root Node
Result returned to the client
![Page 17: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/17.jpg)
17 © Copyright 2012 EMC Corporation. All rights reserved.
How do we sort out this mess?
Shadow systems ‘Shallow’
Business Intelligence
Static schemas accrete over
time
Slow- moving models
Slow- moving
data
Departmental warehouses
Data Sources
Images
![Page 18: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/18.jpg)
18 © Copyright 2012 EMC Corporation. All rights reserved.
Keep the Enterprise Data Warehouse
Enterprise Data Warehouse
• Single Source of Truth • Heavy data governance and quality • Operational reporting • Financial consolidation
![Page 19: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/19.jpg)
19 © Copyright 2012 EMC Corporation. All rights reserved.
Add an Analytics Data Cloud as a Complement
Commodity Hardware
Virtual Machines
Public Cloud
SAS/Greenplum/Hadoop
Enterprise Data Warehouse
• Single Source of Truth • Heavy data governance and quality • Operational reporting • Financial consolidation
Analytics Data Cloud • Source of all raw data (often 10X size of EDW) • Self-service infrastructure to support multiple
marts and sandboxes • Ad hoc, business-led analytics solutions
![Page 20: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/20.jpg)
20 © Copyright 2012 EMC Corporation. All rights reserved.
MAD Analytics
![Page 21: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/21.jpg)
21 © Copyright 2012 EMC Corporation. All rights reserved.
Magnetic
ADC PLATFORM
Data
Analytics
Chorus
Agile
Dat
a Fa
st E
TL/E
LT
Simple linear models Trend analysis
Model selection
Model design
![Page 22: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/22.jpg)
22 © Copyright 2012 EMC Corporation. All rights reserved.
Agile
analyze and model in the
cloud push results back into the
cloud
get data into the cloud
![Page 23: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/23.jpg)
23 © Copyright 2012 EMC Corporation. All rights reserved.
Deep
Pas
t
Fut
ure
Facts Interpretation
What will happen?
How can we do better?
What happened where and
when?
How and why did it happen?
![Page 24: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/24.jpg)
24 © Copyright 2012 EMC Corporation. All rights reserved.
Different Phases of Analytics PREDICTIVE MODEL
Linear Regression
Logistic Regression
Naïve Bayes Classifier
Decision Trees
Neural Networks
TRANSFORMATION Aggregation
Row Filtering
Deriving New Variables
Pivoting
Normalizing
DATA EXPLORATION Frequency
Histogram
Bar Chart
Box Plot Chart
Correlation Matrix Data Exploration
Data Prep Modeling Model Fit
DATA MINING
Association Rule
K-means Clustering
MODEL FIT STATISTICS Goodness of Fit
ROC
Significance statistics for all independent variables
SCORING
Linear Regression
Logistic Regression
Naïve Bayes Classifier
Decision Trees
Neural Networks
Scoring
![Page 25: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/25.jpg)
25 © Copyright 2012 EMC Corporation. All rights reserved.
In-Database Machine Learning • Goal: Build models using all available data
• Principle: Avoid using samples if possible.
• Principle: Bring computation to data, not the other way round.
• In practice: Write machine learning algorithms in (parallelised) data languages like SQL, SAS, and MapReduce.
![Page 26: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/26.jpg)
26 © Copyright 2012 EMC Corporation. All rights reserved.
Design Pattern – Online Learning • Process data one at a time using an incrementally maintained
model; adjust model every time we make a prediction error
• Examples: perceptron, online SVMs, Bayesian filters, etc.
• Such algorithms can be implemented using SAS DATA steps or SQL aggregate functions
![Page 27: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/27.jpg)
27 © Copyright 2012 EMC Corporation. All rights reserved.
Design Pattern – Parallel Ensemble Learning • Break a (large) dataset into i.i.d subsets residing on each node,
learn a model on each subset in parallel, and then combine the models appropriately
• Examples: random forests, ensembles of SVMs, etc.
![Page 28: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/28.jpg)
28 © Copyright 2012 EMC Corporation. All rights reserved.
Design Pattern – MapReduce • Repeatedly apply a Map function to transform (local) chunks of
data and then use a Reduce function to consolidate the transformed results
• Examples: parallel LDA, k-Means, Naive Bayes, etc.
![Page 29: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/29.jpg)
29 © Copyright 2012 EMC Corporation. All rights reserved.
Design Pattern – Prediction Markets
![Page 30: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/30.jpg)
30 © Copyright 2012 EMC Corporation. All rights reserved.
Japanese Telco: What People Are Talking About
!
![Page 31: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/31.jpg)
31 © Copyright 2012 EMC Corporation. All rights reserved.
Traffic Network Modelling
![Page 32: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/32.jpg)
32 © Copyright 2012 EMC Corporation. All rights reserved.
Massively Parallel Model Learning • Solving tens of thousands of statistical modelling problems, one for each
road in the city, in parallel: libname adc greenplm server=gplum db=traffic port=5432 user=keesiong …
proc sql; select origin, dest,
linregr(travel_time, array[peak_period(entry_time), …, origin_vol, dest_vol]) from adc.route_travel_info
group by origin,dest;
• A model: t(x) = 466 + 7.72 peakPeriod(x) + 22.5 workDay(x) + 0.378 originVol(x) + 0.691 destVol(x)
![Page 33: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/33.jpg)
33 © Copyright 2012 EMC Corporation. All rights reserved.
It’s MAD, but is it Mad?
Commodity Hardware
Virtual Machines
Public Cloud
SAS/Greenplum/Hadoop
Enterprise Data Warehouse
• Single Source of Truth • 1 Logical Model • Heavy data governance and quality • Operational reporting • Financial consolidation
Analytics Data Cloud • Source of all the raw data (often 10X size of
the EDW) • Self-service infrastructure to support multiple
marts and sandboxes • Rapid analytic iteration, and business led
solutions
![Page 34: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/34.jpg)
34 © Copyright 2012 EMC Corporation. All rights reserved.
Public Cloud Computing Services
![Page 35: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/35.jpg)
35 © Copyright 2012 EMC Corporation. All rights reserved.
Democratisation of Data
![Page 36: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/36.jpg)
36 © Copyright 2012 EMC Corporation. All rights reserved.
Democratisation of Data
![Page 37: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/37.jpg)
37 © Copyright 2012 EMC Corporation. All rights reserved.
Democratisation of Data
![Page 38: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/38.jpg)
38 © Copyright 2012 EMC Corporation. All rights reserved.
Helping Organizations Evolve From This…
Line Of Business User
Business
I.T. Department
Database Administrator
Business Intelligence Analyst
![Page 39: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/39.jpg)
39 © Copyright 2012 EMC Corporation. All rights reserved.
To This… Line Of Business User
Data Platform Administrator
Business Intelligence Analysts
Data Scientists
![Page 40: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/40.jpg)
40 © Copyright 2012 EMC Corporation. All rights reserved.
You Should Take On Your Journey To Big Data Analytics
Top 3 Steps
![Page 41: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/41.jpg)
41 © Copyright 2012 EMC Corporation. All rights reserved.
3.
Put all your data to work.
![Page 42: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/42.jpg)
42 © Copyright 2012 EMC Corporation. All rights reserved.
2.
Have a data strategy.
Model less, iterate more.
![Page 43: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/43.jpg)
43 © Copyright 2012 EMC Corporation. All rights reserved.
1.
First invest in people, then technology.
![Page 44: Mad skills new analysis practices for big data](https://reader034.vdocuments.us/reader034/viewer/2022051818/54be65494a7959ce148b45fa/html5/thumbnails/44.jpg)
44 © Copyright 2012 EMC Corporation. All rights reserved.
“A journey of a thousand miles begins with a single step”
- Lao-tzu, Chinese philosopher (531 BC)
First Step: Walk towards the SAS-Greenplum Technical Session.