machine learning overview (with sas software)
TRANSCRIPT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WITH SAS WORKSHOPGETTING THE MOST OUT OF YOUR DATA
Longhow Lam
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
AGENDA AND SOME READING MATERIAL
Intro amp positioning of Machine learning SAS platform for Machine learning Overview of Specific methods Some examples
Further reading
An experimental comparison of classification techniques for imbalanced credit scoring data sets using SASreg Enterprise Minerhttpsupportsascomresourcespapersproceedings12129-2012pdf
Benchmarking state-of-the-art classification algorithms for credit scoring A ten-year updatehttpwwwbusiness-schooledacukwafcrc_archive201342pdf
An absolute recommender for more detail The elements of statistical learning Hasting Tibshirani amp Friedman httpwww-statstanfordedu~tibsElemStatLearn
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LONGHOW LAM SHORT BIO
MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs wiskunde) MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)
10+ year SAS experience (Base Stat Guide Miner VA VS) 10+ year R experience ( An introduction to R)
10 + year predictive modeling experience ABNAMRO ndash Risk modeler
Basel Credit risk ALM models BusinessampDecision ndash Quantitative consultant
ING Belgium Fortis Leaseplan Belgium Post
Experian ndash data mininer Collection Score Delphi credit score consulting
longhowlamFollow me
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
INTRO MACHINE LEARNING
WikipedialdquoMachine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data Such algorithms operate by building a model based on inputs and using that to make predictions or decisions rather than following only explicitly programmed instructionsrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR
Statisticalmodeling
SupervisedLearning
Clustering
UnsupervisedLearning
Data mining
Machine learning
Dimensionreduction
Association rules
Recommender
Autoencoders
Self organizing
maps
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IDENTIFY FORMULATE
PROBLEM
DATAPREPARATION
DATAEXPLORATION
TRANSFORMamp SELECT
BUILDMODEL
VALIDATEMODEL
DEPLOYMODEL
EVALUATE MONITORRESULTS
SAS In-Database ScoringSAS Decision Manager
BUSINESSMANAGER
SAS Model Manager
IT SYSTEMS MANAGEMENT
SAS Enterprise Guide
BUSINESSANALYST
Enterprise Miner Text MinerSAS IMSTAT Recommender
DATA MINER DATA SCIENTIST
THE ANALYTICS LIFECYCLE
SAS Visual AnalyticsSAS Visual Statistics
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
AGENDA AND SOME READING MATERIAL
Intro amp positioning of Machine learning SAS platform for Machine learning Overview of Specific methods Some examples
Further reading
An experimental comparison of classification techniques for imbalanced credit scoring data sets using SASreg Enterprise Minerhttpsupportsascomresourcespapersproceedings12129-2012pdf
Benchmarking state-of-the-art classification algorithms for credit scoring A ten-year updatehttpwwwbusiness-schooledacukwafcrc_archive201342pdf
An absolute recommender for more detail The elements of statistical learning Hasting Tibshirani amp Friedman httpwww-statstanfordedu~tibsElemStatLearn
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LONGHOW LAM SHORT BIO
MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs wiskunde) MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)
10+ year SAS experience (Base Stat Guide Miner VA VS) 10+ year R experience ( An introduction to R)
10 + year predictive modeling experience ABNAMRO ndash Risk modeler
Basel Credit risk ALM models BusinessampDecision ndash Quantitative consultant
ING Belgium Fortis Leaseplan Belgium Post
Experian ndash data mininer Collection Score Delphi credit score consulting
longhowlamFollow me
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
INTRO MACHINE LEARNING
WikipedialdquoMachine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data Such algorithms operate by building a model based on inputs and using that to make predictions or decisions rather than following only explicitly programmed instructionsrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR
Statisticalmodeling
SupervisedLearning
Clustering
UnsupervisedLearning
Data mining
Machine learning
Dimensionreduction
Association rules
Recommender
Autoencoders
Self organizing
maps
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IDENTIFY FORMULATE
PROBLEM
DATAPREPARATION
DATAEXPLORATION
TRANSFORMamp SELECT
BUILDMODEL
VALIDATEMODEL
DEPLOYMODEL
EVALUATE MONITORRESULTS
SAS In-Database ScoringSAS Decision Manager
BUSINESSMANAGER
SAS Model Manager
IT SYSTEMS MANAGEMENT
SAS Enterprise Guide
BUSINESSANALYST
Enterprise Miner Text MinerSAS IMSTAT Recommender
DATA MINER DATA SCIENTIST
THE ANALYTICS LIFECYCLE
SAS Visual AnalyticsSAS Visual Statistics
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LONGHOW LAM SHORT BIO
MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs wiskunde) MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)
10+ year SAS experience (Base Stat Guide Miner VA VS) 10+ year R experience ( An introduction to R)
10 + year predictive modeling experience ABNAMRO ndash Risk modeler
Basel Credit risk ALM models BusinessampDecision ndash Quantitative consultant
ING Belgium Fortis Leaseplan Belgium Post
Experian ndash data mininer Collection Score Delphi credit score consulting
longhowlamFollow me
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
INTRO MACHINE LEARNING
WikipedialdquoMachine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data Such algorithms operate by building a model based on inputs and using that to make predictions or decisions rather than following only explicitly programmed instructionsrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR
Statisticalmodeling
SupervisedLearning
Clustering
UnsupervisedLearning
Data mining
Machine learning
Dimensionreduction
Association rules
Recommender
Autoencoders
Self organizing
maps
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IDENTIFY FORMULATE
PROBLEM
DATAPREPARATION
DATAEXPLORATION
TRANSFORMamp SELECT
BUILDMODEL
VALIDATEMODEL
DEPLOYMODEL
EVALUATE MONITORRESULTS
SAS In-Database ScoringSAS Decision Manager
BUSINESSMANAGER
SAS Model Manager
IT SYSTEMS MANAGEMENT
SAS Enterprise Guide
BUSINESSANALYST
Enterprise Miner Text MinerSAS IMSTAT Recommender
DATA MINER DATA SCIENTIST
THE ANALYTICS LIFECYCLE
SAS Visual AnalyticsSAS Visual Statistics
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
INTRO MACHINE LEARNING
WikipedialdquoMachine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data Such algorithms operate by building a model based on inputs and using that to make predictions or decisions rather than following only explicitly programmed instructionsrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR
Statisticalmodeling
SupervisedLearning
Clustering
UnsupervisedLearning
Data mining
Machine learning
Dimensionreduction
Association rules
Recommender
Autoencoders
Self organizing
maps
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IDENTIFY FORMULATE
PROBLEM
DATAPREPARATION
DATAEXPLORATION
TRANSFORMamp SELECT
BUILDMODEL
VALIDATEMODEL
DEPLOYMODEL
EVALUATE MONITORRESULTS
SAS In-Database ScoringSAS Decision Manager
BUSINESSMANAGER
SAS Model Manager
IT SYSTEMS MANAGEMENT
SAS Enterprise Guide
BUSINESSANALYST
Enterprise Miner Text MinerSAS IMSTAT Recommender
DATA MINER DATA SCIENTIST
THE ANALYTICS LIFECYCLE
SAS Visual AnalyticsSAS Visual Statistics
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR
Statisticalmodeling
SupervisedLearning
Clustering
UnsupervisedLearning
Data mining
Machine learning
Dimensionreduction
Association rules
Recommender
Autoencoders
Self organizing
maps
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IDENTIFY FORMULATE
PROBLEM
DATAPREPARATION
DATAEXPLORATION
TRANSFORMamp SELECT
BUILDMODEL
VALIDATEMODEL
DEPLOYMODEL
EVALUATE MONITORRESULTS
SAS In-Database ScoringSAS Decision Manager
BUSINESSMANAGER
SAS Model Manager
IT SYSTEMS MANAGEMENT
SAS Enterprise Guide
BUSINESSANALYST
Enterprise Miner Text MinerSAS IMSTAT Recommender
DATA MINER DATA SCIENTIST
THE ANALYTICS LIFECYCLE
SAS Visual AnalyticsSAS Visual Statistics
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IDENTIFY FORMULATE
PROBLEM
DATAPREPARATION
DATAEXPLORATION
TRANSFORMamp SELECT
BUILDMODEL
VALIDATEMODEL
DEPLOYMODEL
EVALUATE MONITORRESULTS
SAS In-Database ScoringSAS Decision Manager
BUSINESSMANAGER
SAS Model Manager
IT SYSTEMS MANAGEMENT
SAS Enterprise Guide
BUSINESSANALYST
Enterprise Miner Text MinerSAS IMSTAT Recommender
DATA MINER DATA SCIENTIST
THE ANALYTICS LIFECYCLE
SAS Visual AnalyticsSAS Visual Statistics
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IDENTIFY FORMULATE
PROBLEM
DATAPREPARATION
DATAEXPLORATION
TRANSFORMamp SELECT
BUILDMODEL
VALIDATEMODEL
DEPLOYMODEL
EVALUATE MONITORRESULTS
SAS In-Database ScoringSAS Decision Manager
BUSINESSMANAGER
SAS Model Manager
IT SYSTEMS MANAGEMENT
SAS Enterprise Guide
BUSINESSANALYST
Enterprise Miner Text MinerSAS IMSTAT Recommender
DATA MINER DATA SCIENTIST
THE ANALYTICS LIFECYCLE
SAS Visual AnalyticsSAS Visual Statistics
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING
Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments
HIGH PERFORMANCE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Manage Rules + Data + Models
Deployment flexibility BatchReal TimeStored ProcessIn Database
Drive Reuse and Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICT SOMEONErsquoS INCOME
Income = 152 + 1102 times Age
Age
Income
Predict someones income from hisher age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip
You do not have one input variable X1 X2 X3helliphellipX567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects
Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000
Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 200 9000 9000
2 10000 150 5000 14000
3 10000 100 1000 15000
4 10000 100 1000 16000
5 10000 100 1000 17000
6 10000 100 1000 18000
7 10000 100 1000 19000
8 10000 080 -600 18400
9 10000 050 -3000 15400
10 10000 020 -5400 10000
The profit by using a model to sent letters only to the first 7 deciles is now
euro 19000 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 09 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 300 17000 17000
2 10000 200 9000 26000
3 10000 140 4200 30200
4 10000 115 2200 32400
5 10000 100 1000 33400
6 10000 060 -2200 31200
7 10000 040 -3800 27400
8 10000 030 -4600 22800
9 10000 010 -6200 16600
10 10000 005 -6600 10000
The profit by using a much better model to sent letters only to the first 5 deciles is now
euro 33400 (instead of euro 10000)
If you have 100 of such campaigns a year that means an increase of
euro 234 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MACHINE LEARNING WHY IT CAN MATTER euro euro euro
Decile N Conversion Profit Cumulative1 10000 335 19800 19800
2 10000 223 10840 30640
3 10000 130 3400 34040
4 10000 110 1800 35840
5 10000 100 1000 36840
6 10000 055 -2600 34240
7 10000 028 -4760 29480
8 10000 025 -5000 24480
9 10000 005 -6600 17880
10 10000 002 -6840 11040
Now lets suppose we have even a slightly better model than the last one
euro 36840
If you have 100 of such campaigns a year that means an increase of
euro 268 mln
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS
Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines
K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
ldquoCLASSICALrdquo REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LINEAR amp LOGISTIC REGRESSION
Income = a + b times Age
Age
Income
Age
P(Churn)1
0
P(Churn) =
Numeric target variable Binairy target variable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables
Y logit(y)
X
Smoothing Splines
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines Piecewise polynomials that are glued together at knots
Two special cases for λ
λ = 0 Any function that interpolates the data
λ = infin Simple Least square line fit
Choose λ by cross validation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price
Too much smoothing and too little smoothing
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
02 is the optimal smoothing paramter
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Some other car makemodels with spline estimates of car depreciation versus kilometres driven
Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MODELING NON LINEARITIES
In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines
Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles
SPLINE REGRESSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response
We haveknow Age and Marital Status
5050
Agele 45 Agegt 45
3070
6040
MarriedDivorced UnMarried
2080
6040
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES REGRESSION amp CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 12 X
N 21 B 456 15 X
Y 32 A 545 13 U
Y 34 C 443 11 U
N 23 A 345 17 U
N 13 B 567 12 X
N 45 A 654 19 X
hellip hellip hellip hellip hellip hellip
hellip hellip hellip hellip hellip hellip
Y 46 A 657 21 X
A recursive splitting algorithm
1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip
bull How to split X1 or X2 bull When to stop
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Regression tree Mean square error
Split s1 Split t1
x
Y Y
x
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES
How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast
Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification
Mis-classification rate Cross-entropy Chi-Squared
Classification tree Mis classificatie rate
xSplit s1 Split t1
REGRESSION amp CLASSIFICATION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees (regressie amp classificatie)
When to stop Not too early not too late
PruningRemove parts the tree
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C45 C50
CART (Classification and Regression)
The difference is mainly in the different splitting options
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate
cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations
man vrouw
Inkomen lt 45 K Leeftijd lt 33
Response rate
Opel Astras
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DIMENSION REDUCTION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
X1
X2
P 1
P 2
x x x x x x x
xx
x
x
xx
x
x
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
P1
P2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
The Math behind
P = X W[ 119901 11 11990121
1199011119899 1199012119899
]=[ 11990911 11990921
1199091119899 1199092119899
] [11990811 119908 2111990812 119908 22]
w11 and w12 are the loadings corresponding to the first principle component
w21 and w22 are the loadings corresponding to the second principle component
With two dimensions In general
It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS ANALYSIS
Scaling the inputs is important here
Applications of PCADimension reductionVisualisation
Outlier anomalie detectie
PCA regression Use PC instead of the original inputs
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PRINCIPLE COMPONENTS DIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots
XW
P=
XWLPL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition
Diagonal with r singular values [ could be a large number]
UAVT
Σ
Take only k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original2448 X 3264 ~ 8 mln numbers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 15 largest SVrsquos1 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD 75 largest Vrsquos5 of the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling
X1 X2 X3 hellip X500
X1 X21 X35 X430hellip X35
X17 X29 X353 X490hellip X29
X37 X95 X251 X393hellip X251
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
VARIABLE CLUSTERING TO REDUCE THE DIMENSION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAGGING amp BOOSTING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
COMBINE MODELS BAGGING amp BOOSTING
If one model is not good enough let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random sample
Final modeldata
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Bagging amp Boosting Random Forests
Random forests asymp Bagging with trees
Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree The random forest prediction is the majority vote of all trees
In case of a regression tree The random forest prediction is the average of all trees
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100 sub trees) fitted on the simulated data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
GRADIENT BOOSTING SCHEMATIC OVERVIEW
Gradient Boosting M iterations m = 12hellipM
Inputs x
r1
Final model FMhellip M
At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner
Pseudo residuals rim at each step
r2 rMInputs
xInputs
x
Fm = Fm-1 + γmiddothm
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SUPPORT VECTOR MACHINES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Support vector machines (SVM) Suppose we have a separable classification problem
Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line
If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C
The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)
The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification
Non Separable classification
Non Separable classification rewritten using Lagrange Dual problem
Kernels to model nonlinear behaviour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
httpswwwyoutubecomwatchv=3liCbRZPrZA
Linear not separable but in 3D space they are
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K ndash NEAREST NEIGHBOUR
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are
closest in distance to x0bull Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red 2 of them are green so we predict x0 to be red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price
For a Postal code with no price estimate the price by taking the k closest house for sale prices
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Comparing different nearest neighbours in SAS Enterprise Miner
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
K-NN EXAMPLE DUTCH HOUSE PRICES
30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKSDEEP LEARNING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORK LINEAR REGRESSION
f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible
There are four weights wrsquos that have to be determined
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
De functions g and σ are defined as
In case of a binary classifier
The model weights α and β have to be estimated from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wirsquo s For each data point (observation)
1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to
4 Stop if error E is small enough
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layerFor visualisation
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODERS
httpsupportsascomresourcespapersproceedings14SAS313-2014pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLE
bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
NEURAL NETS AUTOENCODER EXAMPLEproc neural
data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes
DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED
archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10
run
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Two dimensional representation of 400 dimensial lsquodigitrsquo data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node
bull Random variables are typically binary or discrete
bull The graph structure can be learned from the data
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
ldquoAdvancedrdquo word counting
Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words
Apply Traditional data mining Clustering Prediction machine learning
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING BASICS
Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo
Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX Abull Each text document is (very) long vector
of word counts (often with many zeros)
bull Apply further mining on this matrix A
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document matrix
bull Often more terms than documents
bull Rows could be strongly correlated
bull Matrix is often very sparse
Apply Singular value decomposition first
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector say of length 300
Matrix SVD decompositie
Diagonal with r singular values [ could be many thousands ]
UAVT
Σ
take only the first k ltlt r singular values
Uk
Ak
VTk
Σk
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn fraud)
Apply machine learning to create a model f to predict the target
Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)
Topic 1 Topic 2 Topic 3
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE Which product should I recommend my customers
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001
User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1
User 4s Item RatingsUser 4 - - 1 2 5
After some mathhellip recommendations are User 4 321 482 1 2 5
Recommend item 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization (SVD - LBFGS)
Market basket analysis Association rules mining (arm)
Mixture of different methods Clustering(cluster) Ensemble
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1
See notes
Item-item based
Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j
Sample rating databaseCustomer Item A Item B Item C
John 5 3 2
Mark 3 4
Lucy 2 5
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings ldquoin the neighborhoodrdquo
How to determine the neighbors and how many (k) to use
How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments
Similarity w
Neighbors N
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS
PEARSON CORRELATION users
rating of user for item
set of items rated both by and bull Possible similarity values between and
119956119946119950 (119938 119939 )=sum119953isin119927
(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)
radic sum119953isin119927
(119955119938 119953minus119955119938 )120784radic sum119953 isin119927
(119955119939 119953minus119955 119939)120784
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS K NEAREST NEIGHBORS METHOD
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem L-BFGS ALS
user
s
items
119894119895=119880 119894119879119881 119895Predict New Rating R
Minimize prediction error min119906 119907
sum119894 119895
(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)
2+120582(iquest119880 1198942+119881 119895
2)iquestiquest
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHODS CLUSTER
Knn within one subgroup
Useritem profile
Useritem rating
Predictions
Clustering
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data
IF item A and B THEN item C IF item X THEN item Y
Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule
trxs X Y
Total trxs Support (XY) =
Lift = Support (XY)
Support (X) Support(Y)
Support amp Lift Diapers Beer 08
Diapers Candles 0018
For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PROC RECOMMEND recom = rsIENS
Add a recommendation system ADD rsIENS item = item user = user rating = rating
Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)
Method SVD LBFGS met 20 factoren METHOD svd
factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs
RUN
METHOD ARM label = ARM
RUN
information on the recommender system INFOQUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
prediction with the SVD method
PROC RECOMMEND recom = rsIENS PREDICT
method = svdlabel = svdNum = 3users = (Longhow Lam)
run
QUIT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
LAST SLIDE
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance (more) difficult to explain
Black box approach (you are rejected The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)
Interactions often ldquoautomaticallyrdquo taken into account
Superior for Text mining Image amp Speech recognition
Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem
(compared to traditional linear logistic regression)PROS AND CONS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
WHY SAS FOR MACHINE LEARNING
bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SOME MACHINE LEARNING EXAMPLES
Text mining Image recognition Sound recognition Strange faces
So can a machine read see and hear
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
PREDICTING SENTIMENT FROM RESTAURANT REVIEWS
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews
and transform reviews to data points in SVD space
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
Predicted review score vs Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 05R2 Neural Net = 06
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST TRAINING DATA
42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified
7030 trainingvalidation split
PCA regression on 50 largest PCrsquos
Seven singel layer neural nets 3 6 12 24 48 100 200 neurons
Seven multi layer neural nets
Three Random forest 100 500 and 1000 trees
8 16 and 24 nearest neighbors
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
MNIST DATA APPLY MODEL ON TEST SET
28000 digits without known labels
Our best model predicted the label for these digits
First 100 predicted digits together with the handwritten digits are displayed here
Red numbers are predicted labels We see obvious some mistakeshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE
1 2
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITION
WAV files consists of ~ 30000 points too much redundancy
Use spectral analysis to convert signal to frequency domain
Still too much apply principle components
TRAIN DATA
8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITION
Zero errors on training data
Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo
In Enterprise MinerNeural network with 9 neurons in one hidden layer
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Little joke on my colleagueshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)
Apply advanced analytics on the ABT
Which faces are look-alikes proc cluster (hierarchical cluster)
Sales faces Predictive modeling machine learning
Who is the Brad Pit Nearest Neighbor
Strange faces proc neural auto-encoder
Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION LOOK ALIKE FACES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION BRAD PIT LOOK A LIKES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION STRANGE FACES
SAS Faces Actors Faces
Read more on my blog
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITION
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITION
Zero errors on training data
Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo
In Enterprise MinerNeural network with 9 neurons in one hidden layer
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Little joke on my colleagueshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)
Apply advanced analytics on the ABT
Which faces are look-alikes proc cluster (hierarchical cluster)
Sales faces Predictive modeling machine learning
Who is the Brad Pit Nearest Neighbor
Strange faces proc neural auto-encoder
Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION LOOK ALIKE FACES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION BRAD PIT LOOK A LIKES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION STRANGE FACES
SAS Faces Actors Faces
Read more on my blog
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
SPEECH RECOGNITION
Zero errors on training data
Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo
In Enterprise MinerNeural network with 9 neurons in one hidden layer
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Little joke on my colleagueshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)
Apply advanced analytics on the ABT
Which faces are look-alikes proc cluster (hierarchical cluster)
Sales faces Predictive modeling machine learning
Who is the Brad Pit Nearest Neighbor
Strange faces proc neural auto-encoder
Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION LOOK ALIKE FACES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION BRAD PIT LOOK A LIKES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION STRANGE FACES
SAS Faces Actors Faces
Read more on my blog
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Little joke on my colleagueshellip
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)
Apply advanced analytics on the ABT
Which faces are look-alikes proc cluster (hierarchical cluster)
Sales faces Predictive modeling machine learning
Who is the Brad Pit Nearest Neighbor
Strange faces proc neural auto-encoder
Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION LOOK ALIKE FACES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION BRAD PIT LOOK A LIKES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION STRANGE FACES
SAS Faces Actors Faces
Read more on my blog
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)
Apply advanced analytics on the ABT
Which faces are look-alikes proc cluster (hierarchical cluster)
Sales faces Predictive modeling machine learning
Who is the Brad Pit Nearest Neighbor
Strange faces proc neural auto-encoder
Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION LOOK ALIKE FACES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION BRAD PIT LOOK A LIKES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION STRANGE FACES
SAS Faces Actors Faces
Read more on my blog
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION LOOK ALIKE FACES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION BRAD PIT LOOK A LIKES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION STRANGE FACES
SAS Faces Actors Faces
Read more on my blog
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION BRAD PIT LOOK A LIKES
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION STRANGE FACES
SAS Faces Actors Faces
Read more on my blog
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION STRANGE FACES
SAS Faces Actors Faces
Read more on my blog
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-
Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved
STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS
SAS Faces Actors Faces
Read more on my blog
- Machine learning with SAS workshop
- Agenda
- Longhow lam
- Intro
- Machine learning
- SAS SOFTWARE
- The Analytics Lifecycle
- Easy to use GUI
- High performance
- Easy deployable
- Predict
- Machine learning (2)
- Machine learning (3)
- Machine learning (4)
- Machine learning (5)
- Machine learning (6)
- Overview of specific machine learning methods
- ldquoClassicalrdquo regression
- linear amp Logistic
- Spline regression
- Spline regression (2)
- Spline regression (3)
- Spline regression (4)
- Slide 24
- Spline regression (5)
- Decision trees
- Decision Trees
- Decision trees (2)
- Decision trees (3)
- Decision trees (4)
- Decision trees (regressie amp classificatie)
- Decision trees (5)
- Decision trees pros and cons
- Dimension reduction
- Principle Components
- Principle Components (2)
- Principle Components (3)
- Principle Components (4)
- Principle Components (5)
- Principle components
- Singular value
- Singular value (2)
- SVD example
- SVD example (2)
- SVD example (3)
- Variable Clustering
- Variable Clustering (2)
- Variable Clustering (3)
- Bagging amp Boosting
- Combine models
- Bagging amp Boosting Random Forests
- Forest vs tree
- FOREST vs TREE
- Gradient boosting
- Gradient boosting (2)
- Support vector machines
- Support vector machines (SVM)
- Support vector machines (SVM) (2)
- Support vector machines (SVM) (3)
- SVM
- Slide 61
- K ndash nearest neighbour
- k-NN
- K-NN
- K-nn
- K-NN example
- Slide 67
- K-NN example (2)
- Slide 69
- Neural networks
- Neural network
- Neural networks (2)
- Neural networks (3)
- Deep learning
- Neural nets
- Neural nets (2)
- Neural net
- Neural nets (3)
- Neural nets (4)
- Slide 80
- Bayesian networks
- Bayesian
- Slide 83
- Text mining
- Text mining
- Text mining (2)
- Text mining (2)
- Text mining (3)
- Text mining (3)
- Recommendation engine
- Recommendation engine
- Recommendation engine (2)
- RE methods
- RE methods (2)
- RE METHODS
- RE Methods
- RE Methods (2)
- RE Methods (3)
- RE Method
- Method
- Slide 101
- Slide 102
- Last slide
- Pros and cons
- Why SAS
- Some machine learning examples
- Predicting sentiment from restaurant reviews
- Iens reviews
- Use machine
- Use machine (2)
- Iens reviews (2)
- MNIST Data in sas
- MNIST
- MNIST data
- MNIST data (2)
- Speech recognition
- speech
- Speech
- speech
- Strange Face detection
- Strange Face detection (2)
- Strange Face detection (3)
- Strange Face detection (4)
- Strange Face detection (5)
- Strange Face detection (6)
-