contemporary analytical ecosystem · • distributed analytics platform hadoop mapreduce,...
TRANSCRIPT
C op yr i g h t © 2013 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
CONTEMPORARY ANALYTICAL ECOSYSTEM
PATRICK HALL, SAS INSTITUTE
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
• (Optional) History Lesson
• 2015 Buzzwords
• Machine Learning for X
• Citizen Data Scientist
• High Quality, Distributed, Open-Source Analytics
• Closing
Agenda
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Data growth
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
1991 1996 2001 2006 2011 2016
Wo
rld
’s D
ata
in
Ze
tta
byte
s
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Data growth
(1 zettabyte = 1 billion terabytes)
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Typical server hard drive was 500GB with a transfer
rate of 98 MB/sec
In 2008
An entire Disk could be transferred in 85 minutes
Typical Server Hard Drive was 4TB with a transfer rate of 150
MB/sec
In 2013
An entire disk could be transferred in 440
minutes
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
$0.00
$0.20
$0.40
$0.60
$0.80
$1.00
$1.20
2000 2005 2010
Average Price 1MB RAM
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
0
500
1000
1500
2000
2500
3000
3500
4000
1978 1982 1985 1989 1995 1997 1999 2000 2005 2008
CPU Speed in MHz
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
• Disk capacities are getting bigger, but disks are not spinning faster
• Processors are not running much faster, but they have more cores
• RAM is becoming affordable
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
So …
• To handle all of this new data we distribute it on clusters of computers
• Most modern analytical architectures take advantage of in-memory, distributed processing
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Analyst
Workstation
Data
MPI Based
Software
Client
Data
Software Server
Analyst
• Multicore CPU
• GPU
• Solid state drive (SSD)
Yesterday’s state-of-the-art:
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MPI Based
• Distributed storage platform
Hadoop Distributed File System (HDFS)
Massively parallel (MPP) databases
• Distributed analytics platform
Hadoop MapReduce, disk-enabled
SAS® High-Performance Analytics or SAS ® LASR
Analytic Server, in-memory
Spark MLlib, H2O.ai, in-memory
Data ScientistDistributed Data
and Software on
Multiple Servers
Software
Client
Today’s state-of-the-art:
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Agenda
• (Optional) History Lesson
• 2015 Buzzwords
• Machine Learning for X
• Citizen Data Scientist
• High Quality, Distributed, Open-Source Analytics
• Closing
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Buzzwords
• “IoT”
• “Cloud”
Trigger
Peak of Inflated
Expectations
Trough of Disillusionment
Plateau of Productivity
You are here.
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Buzzwords: Internet of Things
• Sensor Data?
• Streaming Analytics?
• Privacy?
C op yr i g h t © 2013 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Machine Learning
Insights
OutData In
Recommendation
Data
Visualization
Image
Recognition
Buzzwords: Cloud Machine Learning Platforms
Security ??
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Buzzwords
• “Big Data” You are here.
Trigger
Peak of Inflated
Expectations
Trough of Disillusionment
Plateau of Productivity
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
• “Hadoop Corporate Adoption Remains Low”
• Death of RDBMS exaggerated
• Big data adoption will require time
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Agenda
• (Optional) History Lesson
• 2015 Buzzwords
• Machine Learning for X
• Citizen Data Scientist
• High Quality, Distributed, Open-Source Analytics
• Closing
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Databases
Statistics
KDD
AI
Computational Neuroscience
Data Mining
Data Science
MachineLearning
PatternRecognition
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Data
Min
ing
Machine Learning
TRANSDUCTION
REINFORCEMENT LEARNING
EVOLUTIONARY LEARNING
*In semi-supervised learning, supervised prediction and classification algorithms are often combined with
clustering.
SEMI-SUPERVISED LEARNING
Prediction and classification*Clustering*EM TSVMManifoldregularization Autoencoders
Multilayer perceptronRestricted Boltzmannmachines
SUPERVISED LEARNING
RegressionLASSO regressionLogistic regressionRidge regression
Decision treeGradient boostingRandom forests
Neural networks SVMNaïve BayesNeighborsGaussianprocesses
UNSUPERVISEDLEARNING
A priori rulesClustering
k-means clusteringMean shift clustering Spectral clustering
Kernel densityestimationNonnegative matrixfactorizationPCA
Kernel PCASparse PCA
Singular valuedecompositionSOM
Don’t
know yKnow y
Sometimes
know y
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Machine Learning for X
Let X = { Healthcare , Asset Protection , Manufacturing , Energy ,
Government , Security , Text Mining, … }
• There is a desire to apply advances in machine learning more
broadly.
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Let X = Healthcare
• Ensemble models for epidemiology
• Predicting hospital readmission
• Looking forward: Electronic Medical Records (EMR)
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Let X = Asset Protection
𝑨 = 𝑸𝚲𝑸−1
𝚲𝑖𝑖 = 𝜆𝑖
𝜆2 > 𝑡
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Let X = Manufacturing
• Quality Control
• Deep Learning
What is deep learning?
Using a neural network with more than two hidden layers for a supervised or unsupervised learning task
Why is deep important?
Multiple Scales:
Unsupervised training
y x1 x2 x3
1 2.54 1.65 0.02
0 1.14 0.70 0.82
1 0.99 0.51 2.11
⁞ ⁞ ⁞ ⁞
Target vector Input vectors
Unsupervised training
x1 x2 x3
h11 h12
y
x1 x2 x3
h11 h12
x1 x2 x3
Supervised Neural Network Unsupervised Neural Network
(Known as an autoencoder)
Unsupervised training and stacked layers
Many separate, unsupervised, single hidden-layer networks are used to initialize
a larger supervised network in a layerwise fashion
x1 x2 x3 x4 x5
h11 h12 h13 h14
h21 h22 h23
h31 h32
y
x1 x2 x3 x4 x5
h11 h12 h13 h14
x1 x2 x3 x4 x5h21 h22 h23
h11 h12 h13 h14
h11 h12 h13 h14
h21 h22 h23
h31 h32
h21 h22 h23
h31 h32
y
The weights from layerwise
pre-training can be used as
an initialization for
training the entire deep
network!
Let X = Government
Shadow
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Let X = Security
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Target Layer
Input Layer
Extractable
Features
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PROC CLUSTER data=face.sas_gezichten4 method=average
plots=all outtree = face.gezichtentree ccc pseudo;
id name;
RUN;
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Let X = Energy Production
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
h1
h2
h3
h4
h5
Partially Corrupted Input Features
Hidden Neurons
Hidden Neurons
Hidden Neurons
Hidden Neurons
Hidden Neurons
Uncorrupted Output Features Target Layer
Input Layer
Extractable Features
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
h52
Target 3 Target 4Target 2Target 1
h51
W51 W52 W53 W54
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
h51 Edge Weights Handwritten Eight
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Agenda
• (Optional) History Lesson
• 2015 Buzzwords
• Machine Learning for X
• Citizen Data Scientist
• High Quality, Distributed, Open-Source Analytics
• Closing
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Citizen Data Scientist
People using data …
• for the good of humanity!
• because their boss said so. (grumble, grumble)
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Citizen Data Scientist
• Datakind
• Watson Analytics
• Beyondcore
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Agenda
• (Optional) History Lesson
• 2015 Buzzwords
• Machine Learning for X
• Citizen Data Scientist
• High Quality, Distributed, Open-Source Analytics
• Closing
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
High Quality, Distributed, Open-Source Analytics
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
High Quality, Distributed, Open-Source Analytics
• Excellent Core,
ETL, and basic
analytics
• RDD Dataframe
• MLlib is immature:
Production
analytics?
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
High Quality, Distributed, Open-Source Analytics
• Innovative data
compression
routines
• Full-featured
analytics
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Agenda
• (Optional) History Lesson
• 2015 Buzzwords
• Machine Learning for X
• Citizen Data Scientist
• High Quality, Distributed, Open-Source Analytics
• Closing
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Don’t be afraid
… yet
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
!!!???!!!
The ‘IT’ folks The ‘Analytics’ folks
I just built 850
new models.
When can you
put them into
production?
C op yr i g h t © 2014 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DC Data Community Meetup - July 24th
“Playing Nice: Using PMML, Python, R, and SAS for Production Analytics”
SAS Data Mining Communityhttps://communities.sas.com/community/support-communities/sas_data_mining_and_text_mining
Quora Github Twitterwww.quora.com github.com/jphall663 @jpatrickhall
github.com/sassoftware
Where you can find me …